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Summary.  Network  data  often  take  the  form  of  repeated  interactions  between  senders  and  re¬ 
ceivers  tabulated  over  time.  A  primary  question  to  ask  of  such  data  is  which  traits  and  behaviors 
are  predictive  of  interaction.  To  answer  this  question,  a  model  Is  introduced  for  treating  directed 
interactions  as  a  multivariate  point  process:  a  Cox  multiplicative  intensity  model  using  covarlates 
that  depend  on  the  history  of  the  process.  Consistency  and  asymptotic  normality  are  proved  for 
the  resulting  partlal-llkelihood-based  estimators  under  suitable  regularity  conditions,  and  an  efficient 
fitting  procedure  Is  described.  Multicast  Interactions — those  involving  a  single  sender  but  multiple 
receivers — are  treated  explicitly.  The  resulting  Inferential  framework  is  then  employed  to  model  mes¬ 
sage  sending  behavior  in  a  corporate  e-mail  network.  The  analysis  gives  a  precise  quantification  of 
which  static  shared  traits  and  dynamic  network  effects  are  predictive  of  message  recipient  selection. 

Keywords:  Cox  proportional  hazards  model;  Network  data  analysis;  Partial  likelihood  infer¬ 
ence;  Point  processes 

1.  Introduction 

Much  effort  has  been  devoted  to  the  statistical  analysis  of  network  data;  see  Jackson  (2008), 
Goldenberg  et  al.  (2009),  and  Kolaczyk  (2009)  for  recent  overviews.  Often  network  observables 
comprise  counts  of  interactions  between  individuals  or  groups  tabulated  over  time.  Communica¬ 
tions  networks  give  rise  to  directed  interactions:  phone  calls,  text  messages,  or  e-mails  exchanged 
amongst  a  given  set  of  individuals  over  a  specific  time  period  (Tyler  et  ah,  2005;  Eagle  and  Pent- 
land,  2006).  Specific  examples  of  repeated  interactions  from  other  types  of  networks  include  the 
following:  Fowler’s  (2006)  study  of  legislators  authoring  and  cosponsoring  bills  (a  collaboration 
network);  Mckenzie  and  Rapoport’s  (2007)  study  of  families  migrating  between  communities  in 
Mexico  (a  migration  network);  Sundaresan,  Fischoff,  Dushoff,  and  Rubenstein’s  (2007)  study  of 
zebras  congregating  at  locations  in  their  habitat  (an  animal  association  network) ;  and  Papachris- 
tos’s  (2009)  study  of  gangs  in  Chicago  murdering  members  of  rival  factions  (an  organized  crime 
network) . 

In  this  article,  we  consider  partial-likelihood-based  inference  for  general  directed  interaction 
data  in  the  presence  of  covariates.  We  first  develop  asymptotic  theory  for  the  case  in  which  inter¬ 
actions  are  strictly  pairwise,  and  then  generalize  our  results  to  the  multiple-receiver  (multicast) 
case;  we  also  provide  efficient  algorithms  for  partial  likelihood  maximization  in  these  settings. 
Our  main  assumption  on  the  covariates  is  that  they  be  predictable,  which  allows  them  to  vary 
with  time  and  potentially  depend  on  the  past. 

Address  for  correspondence:  Patrick  O.  Perry,  Information,  Operations,  and  Management  Sciences 
Department,  Stern  School  of  Business,  New  York  University,  44  West  4th  St,  New  York,  NY  10012,  USA 
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The  interaction  data  we  consider  comprise  a  set  of  triples,  with  triple  indicating  that 

at  time  t,  directed  interaction  i  ^  j  took  place — for  instance,  individual  i  sent  a  message  to 
individual  j.  Given  such  a  set  of  triples,  a  primary  modeling  goal  lies  in  determining  which 
characteristics  and  behaviors  of  the  senders  and  receivers  are  predictive  of  interaction.  In  this 
vein,  three  important  questions  stand  out: 

Homophily  Is  there  evidence  of  homophily  (an  increased  rate  of  interaction  among  similar 
individuals)?  To  what  degree  is  a  shared  attribute  predictive  of  heightened  interaction? 

Network  Effects  To  what  extent  are  past  interaction  behaviors  predictive  of  future  ones?  If 
we  observe  interactions  i  — >  /i  and  — )■  j,  are  we  more  likely  to  see  the  interaction  I  — )■  j? 

Multiplicity  How  should  multiple-receiver  interactions  of  the  type  i  — >■  {ji,j2,  ■  ■  ■ ,  ji}  be  mod¬ 
eled?  What  are  the  implications  of  treating  these  as  L  separate  pairwise  interactions? 

The  issues  of  homophily,  network  effects,  and  their  interactions  arise  frequently  in  the  net¬ 
works  literature;  see,  e.g.,  McPherson  et  al.  (2001);  Butts  (2008);  Aral  et  al.  (2009);  Snijders  et  al. 
(2010),  and  references  contained  therein.  Multiplicity  has  largely  been  ignored  in  this  context, 
however,  with  notable  exceptions  including  Lunagomez  et  al.  (2009)  for  graphical  models,  and 
Shafiei  and  Chipman  (2010)  for  network  modeling. 

In  the  remainder  of  this  article,  we  provide  a  modeling  framework  and  computationally  effi¬ 
cient  partial  likelihood  inference  procedures  to  facilitate  analysis  of  these  questions.  We  employ 
a  Cox  proportional  intensity  model  incorporating  both  static  and  history-dependent  covariates 
to  address  the  first  of  these  two  questions,  and  a  parametric  bootstrap  to  address  the  third.  Sec¬ 
tion  2  presents  our  point  process  model  for  directed  pairwise  interactions,  along  with  the  resultant 
inference  procedures.  Section  3  proves  consistency  and  asymptotic  normality  of  the  correspond¬ 
ing  maximum  partial  likelihood  estimator,  and  Section  4  extends  our  framework  to  the  case  of 
multiple-receiver  interactions.  Section  5  employs  this  framework  to  model  message  sending  be¬ 
havior  in  a  corporate  e-mail  network.  Section  6  evaluates  the  strength  of  homophily  and  network 
effects  in  explaining  these  data,  and  discusses  the  results  of  three  comparative  analyses.  Section  7 
concludes  the  main  body  of  the  article,  and  Appendices  A-C  contain  implementation  details  and 
technical  lemmas. 

2.  A  point  process  model  and  partial  likelihood  inference 

Every  interaction  process  can  be  encoded  by  a  multivariate  counting  measure.  For  sender  i, 
receiver  j,  and  positive  time  t,  define 

^t{hj)  =  #{directed  interactions  f  — >•  j  in  time  interval  [0,t]}. 

For  technical  reasons,  assume  that  No{i,j)  =  0  and  that  Nt{i,j)  is  adapted  to  a  stochastic  basis  of 
cr-algebras  {J-t}t>o  satisfying  the  usual  conditions.  Then,  Nt{i,j)  is  a  local  submartingale,  so  by 
the  Doob-Meyer  decomposition,  there  exists  a  predictable  increasing  process  At{i,j),  null  at  zero, 
such  that  Nt(i,  j)  —  At{i,j)  is  an  Jq-local  martingale.  Under  mild  conditions — the  most  important 
of  which  is  that  no  two  interactions  happen  simultaneously — there  exists  a  predictable  continuous 
process  Xt{i,j)  such  that  At{i,j)  =  fg  As(i,j)  ds.  (In  practical  applications,  simultaneous  events 
exist  and  are  an  annoyance;  Efron  (1977)  handles  simultaneity  through  an  ad-hoc  adjustment, 
while  Brostrom  (2002)  adds  a  discrete  component  to  A.)  The  process  A  is  known  as  the  stochastic 
intensity  of  N.  Heuristically, 


dt  =  Pjinteraction  z  — )■  j  occurs  in  time  interval  [t,  t  +  dt)}. 
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We  will  model  N  through  A  using  a  version  of  the  Cox  (1972)  proportional  intensity  model. 

Let  I  be  a  set  of  senders  and  77  be  a  (not  necessarily  disjoint)  set  of  receivers.  For  each  sender 
z,  let  At(z)  be  a  non-negative  predictable  process  called  the  baseline  intensity  of  sender  z;  let  Jz(z) 
be  a  predictable  finite  subset  of  77  called  the  receiver  set  of  sender  i.  For  each  sender-receiver 
pair  (z,  j),  let  Xt{i,j)  be  a  predictable  locally  bounded  vector  of  covariates  in  Rp.  Let  /3o  be 
an  unknown  vector  of  coefficients  in  For  the  remainder  of  this  section,  assume  that  each 
interaction  has  a  single  receiver. 

Given  a  multivariate  counting  process  N  on  IR_|_  x  Z  x  77,  we  model  its  stochastic  intensity  as 

Ai(Li)  =  At(z)  •  ex.p{l3'^ xtii,  j)}  ■  l{j  €  (1) 

This  model  posits  that  sender  z  in  Z  interacts  with  receiver  j  in  77t(z)  at  a  baseline  rate  \t{i) 
modulated  up  or  down  according  to  the  pair’s  covariate  vector,  xt{i,j).  As  Efron  (1977)  notes, 
the  specific  parametric  form  for  the  multiplier  exp{/3(]’a;t(z,  j)}  is  not  central  to  the  theoreti¬ 
cal  analysis,  but  this  choice  is  amenable  to  computation  and  gives  the  parameter  vector  /3q  a 
straightforward  interpretation.  Butts  (2008)  used  a  variant  of  this  model  to  analyze  repeated 
directed  actions  within  social  settings. 

The  form  of  (1)  is  deceptively  simple  but  remains  flexible  enough  to  be  useful  in  practice. 
The  model  allows  for  homophily  and  group  level  effects  via  inclusion  of  covariates  of  the  form 
“l{z  and  j  belong  to  the  same  group},”  where  “group”  is  some  observable  trait  like  ethnicity, 
gender,  or  age  group.  Its  real  strength,  though,  is  that  Xt{i,j)  is  allowed  to  be  any  predictable 
process,  in  particular  Xt{i,j)  can  depend  on  the  history  of  interactions.  To  model  reciprocation 
and  transitivity  in  the  interactions  (with  Z  =  77),  for  example,  choose  appropriate  values  for  A^, 
and  include  relevant  covariates  in  Xt(i,j): 

Ijinteraction  j  —r  i  occurred  in  [t  —  A^,  t)} 


and 

Ijfor  some  h,  interactions  i  h  and  h  ^  j  occurred  in  [t  —  A^,,  t)}. 

Any  process  measurable  with  respect  to  the  predictable  cr-algebra  is  a  valid  covariate;  this  ex¬ 
cludes  only  covariates  depending  on  the  future  or  the  immediate  present.  In  Section  5.2  we  detail 
specific  covariates  suitable  for  measuring  homophily  and  network  effects. 

Also  note  that  despite  presuming  Z  and  77  to  be  hxed,  our  analysis  allows  senders  and 
receivers  to  enter  and  leave  the  study  during  the  observation  period.  The  effective  number  of 
senders  at  time  t  is  the  set  of  z  such  that  At(z)  ^  0,  which  potentially  varies  with  time.  Likewise, 
the  effective  number  of  receivers  is  controlled  through 

Following  Cox  (1975),  we  treat  the  baseline  rate  Xt{i)  as  a  nuisance  parameter  and  estimate 
the  coefficient  vector  /Iq  using  a  partial  likelihood.  Specifically,  let  ji),  ■  ■  ■  jn)  be 

the  sequence  of  observed  interactions.  The  inference  procedure  is  motivated  by  decomposing  the 
full  likelihood,  L,  as 


,  .  .  .  ,  tfi-)  ini  jn') 

=  L{t2,i2\tiiiiji)  ^(j2|t2, ^2, Ji) 

•  1  in  l^n— 1 1  in—li  jn—  li  •  •  •  5  5  1  in  1  i'n—  Ijin—l  •  • 

—  ^{^11  il')  Lr(t2i  i2\tli  ill  jl')  '  '  '  1  in  |^n— 1 1  in—h  jn—  Ij  ■  ■  •  ili  jl) 

-^ij2\i‘2ii2ii^liili  jl^  '  '  '  ^ijn  l^n  lini  ^n—  hin—l  ■  •  -  ) 
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the  term  comprised  of  the  product  of  conditional  likelihoods  of  ji , . . . ,  is  dubbed  a  partial 
likelihood.  In  continuous  time,  the  log  partial  likelihood  at  time  t,  evaluated  at  /3,  is 

log  PLtiP)  =  I P'^xt  Jim,  jni) -'^og[  exp{/3'^xtji,„,j)}]|.  (2) 

In  Section  3,  we  prove  under  suitable  regularity  conditions  that  the  maximizer  of  log  PLt{-)  is  a 
consistent  estimator  of  /?o  as  t  increases. 

The  function  log  PLJ)  is  concave,  and  so  can  be  maximized  via  Newton’s  method  or  a 
gradient-based  optimization  approach  (Nocedal  and  Wright,  2006).  These  methods  require  one 
or  both  of  the  first  two  derivatives  of  logPLj),  which  can  be  expressed  in  terms  of  weighted 
means  and  covariances  of  the  covariates.  The  weights  are 

wt{l3,i,j)  =  exp{^'^a;t(z,  j)}  ’  Jj  G  Jt{i)},  (3a) 

Wt{l3,i)=  Y  ^tiP,i,j)-  (3b) 

The  inner  sum  in  logPLt{(3)  is  Wtjj3,im)-  The  function  logITt(-,z)  has  gradient  Et{-,i)  and 
Hessian  Vt{-,i),  given  by 


WtW,i) 


VtW,i)  = 


1 


WtiJi) 


Y  ^t{i,j)  -  Et{/3,i) 


1®2 


(4a) 

(4b) 


where  =  a®  a  =  aJ- .  Consequently,  the  gradient  and  negative  Hessian  of  log  PLJ)  are 

C/t(/3)  =  V[logPTt(/3)]  =  Y  Xt^{im,jm)  -  Et^{l3,im),  (5a) 

tm<t 

hj)  =  -V2[logPLt(/3)]  =  ^  Vt^iJim).  (5b) 


tm<t 


We  call  Utjo)  the  unnormalized  score  and  ItiPo)  the  observed  information  matrix. 

Note  the  dependence  of  these  terms  on  time-varying  covariates,  which  precludes  using  suf¬ 
ficient  statistics  and  introduces  additional  complexity  in  maximizing  log  PLt{-).  For  most  large 
interaction  datasets,  existing  computational  routines  for  handling  Cox  models  (e.g.,  the  function 
coxph  from  the  survival  package  for  R  (Therneau  and  Lumley,  2009))  will  not  suffice.  In  Ap¬ 
pendix  A,  we  describe  a  customized  method  for  maximizing  log  PLt{-)  that  exploits  sparsity  in 

XtiiJ)- 


3.  Consistency  of  maximum  partiai  iikeiihood  inference 

Under  the  model  of  Section  2,  the  maximum  partial  likelihood  estimator  (MPLE)  is  a  natural 
estimate  of  /?o;  the  inverse  Hessian  of  logPTt(-)  evaluated  at  the  MPLE  is  a  natural  estimate  of 
its  covariance  matrix.  We  now  give  conditions  under  which  these  estimators  are  consistent. 

In  the  sampling  regime  where  observation  time  t  is  fixed  and  the  number  of  senders  \T\ 
increases,  Andersen  and  Gill’s  (1982)  consistency  proof  for  the  Cox  proportional  hazards  model  in 
survival  analysis  extends  to  cover  model  (1).  This  setting  is  natural  in  the  context  of  clinical  trial 
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data,  where  X  corresponds  to  the  set  of  patients  under  study,  but  does  not  meet  the  requirements 
typical  of  interaction  data.  For  most  interaction  data  we  cannot  control  X  and  and  the 
only  way  to  collect  more  data  is  to  increase  the  observation  time.  Cox  (1972,  1975)  outlines 
a  proof  for  general  MPLE  consistency  that  applies  to  our  sampling  regime,  but  his  argument 
is  heuristic;  Wong’s  (1986)  treatment  is  more  rigorous  but  does  not  cover  continuous  or  time- 
varying  covariates.  The  general  interaction  data  sampling  regime  warrants  a  new  consistency 
proof. 

Our  proof  of  consistency  relies  on  rescaling  time  to  make  the  interaction  times  uniform.  To 
this  end,  define  marginal  processes  Nt{i)  =  J2jej  note  that 

tn  =  sup{t  :  Nt  <  n}  is  &  stopping  time  and  let  X)^  be  the  cr-algebra  of  events  prior  to  The 

main  idea  of  the  proof  is  to  change  time  from  the  original  scale  to  a  scale  on  which  is 

proportional  to  n  —  n' . 


3.1.  Assumptions 

Let  be  a  neighborhood  of  /So-  For  a  vector,  a,  let  ||a|j  denote  its  Euclidean  norm;  for  a  matrix, 
A,  let  |jA||  denote  its  spectral  norm,  equal  to  the  largest  eigenvalue  of  We  require  the 

following  assumptions: 

Al.  The  covariates  are  uniformly  square-integrable.  That  is. 


E 


sup||a;t(i,j)f 


is  bounded. 


A2.  The  integrated  covariance  function  is  well-behaved.  When  j3  &  B  and  a  G  [0,1],  as 
n  — >  oo,  then  with  respect  to  the  covariance  function  Sq,(/3)  we  have  that 

^  iei 


A3.  The  interaction  arrival  times  are  finite.  For  each  n. 


P{t„  <  oo}  =  1. 


A4.  The  variance  fnnction  is  equicontinuons.  More  precisely, 

|Vt„(-,z)  :  n  >  l,i  G  x|  is  an  equicontinuons  family  of  functions. 

These  technical  assumptions  are  similar  to  those  of  Andersen  and  Gill  (1982),  who  investigate 
specific  settings  in  which  their  assumptions  hold.  Note  that  when  ||a;t(z,j)||  is  bounded  and 
Assumption  A3  is  in  force,  the  remaining  assumptions  follow. 

3.2.  Main  results 

Assumptions  A1-A4  imply  that  the  MPLE  is  consistent  and  asymptotically  Gaussian,  as  shown 
by  the  following  two  theorems. 

Theorem  3.1.  Let  N  be  a  multivariate  counting  process  with  stochastic  intensity  as  given 
in  (1),  with  true  parameter  vector  Pq.  Let  tn  be  the  sequence  of  interaction  times,  and  set  UpP) 
and  It{P)  to  be  the  gradient  and  negative  Hessian  of  the  log  partial  likelihood  function  as  given 
respectively  in  (5a)  and  (5b).  If  assumptions  A1-A2  hold,  then  as  n  ^  oo; 
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(a)  ^iLanj  iPo)  converges  weakly  to  a  Gaussian  proeess  on  [0, 1]  with  eovariance  function 

s„(/3o); 

(b)  if  assumptions  A3-A4  also  hold,  then  for  any  consistent  estimator  fin  of  jio,  we  have  that 


sup 

ae[0,l] 


n^t\a'n.\iPn)  5]q,(/3o) 


0. 


We  don’t  actually  require  convergence  of  the  whole  sample  path,  but  it  turns  out  to  be  just  as 
much  effort  to  prove  as  convergence  of  the  endpoint. 

Consistency  is  a  direct  consequence  of  Theorem  3.1,  and  is  proved  in  Appendix  B. 

Theorem  3.2.  Let  N  be  a  multivariate  counting  process  with  stochastic  intensity  as  given 
in  (1),  with  true  parameter  vector  fio.  Let  the  log  partial  likelihood,  \ogPLt{-),  be  as  defined 
in  (2).  Let  tn  be  the  sequence  of  interaction  times. 

Assume  that  for  fi  in  a  neighborhood  offio  t/iat  —  ^V^[logPLt^(/3)]  Si(/3),  where  Si(-)  is  lo¬ 

cally  Lipschitz  and  with  smallest  eigenvalue  bounded  away  from  zero.  If  Pn  maximizes  log  PLt„{-) 
and  conclusion  (a)  of  Theorem  3.1  holds,  then  the  following  are  true  as  n  ^  oo: 

(a)  Pn  is  a  eonsistent  estimator  of  Pq; 

(b)  \/n{Pn  —  Po)  converges  weakly  to  a  mean-zero  Gaussian  random  variable  with  covariance 

Pi(/3o)]-^ 

Proof  (Theorem  3.1).  Observethattheprocess  j)  hascompensator  At(i,j)  =  f*  Ag(i,j)  ds; 
similarly,  processes  lVt(i)  and  Nt  have  compensators  At(i)  =  Jf^j^jAt(i,j)  and  At  = 

Correspondingly,  define  local  martingales  Mt{i,j)  =  Nt{i,j)  —  At{i,j),  Mt{i)  =  Nt{i)  —  At{i), 
and  Mt  =  Nt  —  At;  also  define 


Ht{i,j)  =  xt{i,j)  -  Et{Po,i), 


where  Et{P,i)  is  as  defined  in  (4a). 

As  observed  by  Andersen  and  Gill  (1982),  the  score  function  Ut{-)  evaluated  at  Po  has  a 
simple  representation  in  terms  of  these  processes: 


Hs{i,j)  dNs{i,j) 


Hs{i,j)dMs{i,j), 


since  Jq  Hs{i,j)  dAs{i,j)  =  0.  Since  by  Assumption  Al,  x  is  uniformly  bounded,  H  is  as 

well.  Each  term  in  the  sum  above  is  thus  locally  square  integrable,  with  predictable  covariation 


Hs{i',j')  dMs{i',f) 


=  [  Hs{i,j)®Hs{i,j')d{M{i,j),M(i',j'))^ 

Jo 

=  [  [Hsii,j)]^^dAs{i,j)-l{i  =  i',j=f} 

Jo 


Hs{i,j)dMs{i,j) 
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(Fleming  and  Harrington,  1991,  Thm.  2.4.3).  There  exists  a  sequence  of  stopping  times  localizing 
all  M{i,j)  simultaneously,  so  U{l3o)  is  locally  square  integrable  with  predictable  variation 


(UiM),  =  E  E  r 

iei 

=  E  /  Vs iPo,  i)d As (i). 

Jo 


(6) 


Now  we  rescale  time.  For  each  positive  n  define  a  discretized  time-scaled  version  of  the  score 
that  is  right-continuous  with  limits  from  the  left.  The  process  is  defined  for  times  a  in  [0, 1]; 
between  times  in  [^,  it  takes  the  value  Ut^;  i.e., 

=  (7) 

Part  (a):  Lemma  B.l  shows  that  U^\(3q)  is  a  square-integrable  martingale  adapted  to 
,  the  cr-algebra  of  events  prior  to  tyan]  ■  Since  it  only  depends  on  values  at  jump 
times,  the  quadratic  variation  of  U^'^\I5q)  at  time  a  is  equal  to  the  quadratic  variation  of  U{(3o) 
at  time  Therefore,  since  quadratic  and  predictable  variation  have  the  same  limit  when 

it  exists  (Rebolledo,  1980,  Prop.  1),  assumption  A2  implies  that  (/3q)),^  T,a{Po). 

Lemma  B.2  in  turn  verifies  that  satishes  a  Lindeberg  condition  necessary  for  the 

application  of  Rebolledo’s  (1980)  Martingale  Central  Limit  Theorem.  Thus  the  process  converges 
in  distribution  to  a  Gaussian  process  with  covariance  function  Eq(/3o)  as  claimed. 

Part  (b):  Recalling  Mt{i)  =  Nt{i)  —  At{i),  combine  (5b)  and  (6)  to  obtain  the  relation 

E  K(/3o,*)dM,(t)  =/t^_j(/3o) (/3o))„.  (8) 

i  "'0 

When  a  €  [0, 1],  a  repeated  application  of  the  triangle  inequality  to 

I  (/3n)  -  (^o)  -  (/?o))  -  Sa(/3o)| 

using  the  relation  of  (8)  yields 


i/*L_j(/3„)-E„(/3o) 


-k 


-k 


E/  {Vs0„,i) -Vsil3o,i)}dNs{i) 

,  Jo 
n  ^  Jo 

I  r  LoH 

n  ^  Jo 


We  show  that  all  three  terms  converge  to  zero  in  probability.  The  first  term  above  is  uniformly 
bounded  by  sup„/_j  \\Vt^,  (/3„,i)  — Vt^,(/3o,*)||,  which  converges  to  zero  since  /3„  — >■  /3o  by  hypothesis 
of  the  theorem  and  (•,*)}  is  an  equicontinuous  family  by  assumption  A4.  Lemma  B.3 
proves,  as  a  consequence  of  assumption  A3  and  Lenglart’s  (1977)  Inequality,  that  the  second 
term  converges  to  zero  uniformly  in  a.  The  third  term  converges  to  zero  by  assumption  A2, 
thereby  concluding  the  proof. 
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4.  Multicast  interactions 


In  Sections  2  and  3,  we  have  assumed  that  each  interaction  involves  a  single  sender  and  a  single 
receiver.  The  model  and  corresponding  asymptotic  theory  are  sufficient  to  cover  strictly  pairwise 
directed  interactions  (e.g.,  phone  calls),  but  they  do  not  describe  interactions  that  can  involve 
multiple  receivers  (e.g.,  e-mail  messages).  We  call  an  interaction  involving  a  single  sender  and 
possibly  multiple  receivers  a  multicast  interaction. 

In  practice,  multicast  interactions  are  typically  treated  in  an  ad-hoc  manner  via  duplication — 
for  example,  interaction  i  — >■  {ji,  gets  recorded  as  three  separate  pairwise  interactions 

f  — )■  ji ,  z  — )■  j2 ,  and  z  — )■  j3 — giving  rise  to  approximate  likelihood  and  inference.  In  this  section 
we  explore  the  implications  of  using  this  approximate  likelihood  in  the  multicast  setting.  In 
particular  we  show  it  to  be  closely  related  to  an  extension  of  our  model  for  directed  pairwise 
interactions,  and  that  the  bias  introduced  by  such  an  approximation  can  be  quantified  and  in 
certain  cases  corrected. 

To  this  end,  we  introduce  an  extension  of  the  model  to  the  multicast  setting.  Let  X, 
xt{i,j),  and  /3o  be  as  in  Section  2.  For  each  sender  z  and  positive  integer  L,  let  Xt{i]L) 
be  a  non-negative  predictable  process  called  the  baseline  L-receiver  intensity  of  sender  z.  Let 
(ti,  zi,  Ji), . . . ,  (tn,  in,  Jn)  be  the  sequence  of  observed  multicast  interactions,  with  tuple  (t,  z,  J) 
indicating  that  at  time  t,  sender  i  interacted  with  receiver  set  J.  For  a  set  J,  let  |  J|  denote  its 
cardinality. 

Consider  a  model  for  multicast  interactions  where  the  rate  of  interaction  between  sender  i 
and  receiver  set  J  is 

At(z,  J)  =  At(z;|J|)  •exp|^/3j'xt(z,j)|  •  l{j  G  Jt{i)}-  (9) 

3^J  jeJ 

The  log  partial  likelihood  at  time  t,  evaluated  at  /?,  is 

logPTt(/3)  =  ^  I  ^ /3'^a;tjz,„,j)  -  log  [  ^  exp  {  j)}]  j-  (10) 

\J\  =  \Jn,\ 

Suppose  instead  of  using  the  multicast  model,  we  use  duplication  to  get  pairwise  interactions 
from  the  original  multicast  data.  If  we  use  the  model  of  (I)  for  the  pairwise  data  and  ignore  ties 
in  the  interaction  times,  we  obtain  an  approximate  partial  likelihood: 

logPLt(^)  =  ^  I  ^ /^'^xtjz^,  j)  -  km|  log  [  ^  exp{/3'^a:tjz„,j)}]|.  (11) 

We  claim  log  PLt{P)  approximates  log  PLt{P).  Heuristically,  replacing  the  sum  over  all  sets 
of  size  \Jm\  in  (10)  with  a  sum  over  all  multisets  of  size  \  Jm\  (i-O.,  allowing  duplicate  elements 
from  Jt^{im)),  observe 

log[  ^  exp{  ^/3'^xtjz,„,j)}]  «  log  [(  ^  exp{/3'^a;tjz™,j)})''^™'] 

jGJ 

\J\  =  \Jrr,\ 

=  |J„|log[  ^  exp{/3'^a:tjzm,j)}]. 

In  this  sense,  logFTt(/3)  ~  logPL((/3).  Section  4.1  makes  this  statement  more  precise,  and 
Section  4.2  analyzes  the  bias  introduced  by  maximizing  logPLt(/3)  in  lieu  of  log  PLt{f3). 
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4. 1.  Approximation  error 

Define  the  receiver  set  growth  sequence 

^  IjlJml  >  1} 


(12) 


Thi^equence  plays  a  critical  role  in  bounding  the  error  introduced  by  replacing  log  PL  with 
log  PL.  Note  that  when  \Jt^  {im)  \  is  constant  G„  has  linear  growth,  but  when  |  increases, 

Gn  often  has  sublinear  growth.  For  example,  the  Cauchy-Schwartz  inequality  gives 


Gn  ^  y/n  ■ 


E 


SO  if  |j7t„(tm)|  =  ui{\/m),  then  Gn  =  0{y/n).  Theorem  4.1  bounds  the  approximation  error  in 
terms  of  G„. 


Theorem  4.1.  Let  (tm,  im,  Jm)  be  a  sequence  of  observations  from  a  multivariate  point  pro¬ 
cesses  with  intensity  as  given  in  (9).  Assume  that  sup^  ||a;t(i,j)||  and  sup^  \Jm\  are  bounded  in 
probability.  If  log  PL  and  log  FT  are  as  defined  in  {10-11),  and  Gn  is  as  defined  in  (12),  then 
for  P  in  a  neighborhood  of  /do, 


V[logFTt„(^)]-V[logFTt„(/3)] 


Op{Gn), 


and 


V2[logFTt„(/3)]-V2[logFTt„(/3)] 


Op{Gn). 


Proof.  When  J  C  Ji(i),  set  Xt{i,J)  =  J2j(£j^t{i,j)  and  Wt{l3,i,J)  =  exp{/?'^W(b  </)}.  As 
a  slight  abuse  of  notation,  when  j  is  an  element  of  Jt{i),  take  ^^Wt{l3,i,  j)’'  to  mean  wt{(i,i,  {j})- 
Define  weights 


Wt{l3,i-,L)  = 


Wt{ld,i;L)  = 


^  wt{/3,i,J), 

JQJtii), 

|,7|=L 

E  '^t{P,i,j)  , 


and  note  that  the  approximation  error  in  log  PLt{j3)  comes  from  replacing  W  with  W. 
The  gradients  of  the  weights  are 


Et{p,i-L)=X[logWt{p,i-L)] 


1 

Wt{P,i;L) 


^  Wt{i,J)Xt{i,J), 

JQJtii). 

\J\=L 


Et{/3,i;L)  =X  [log  Wt{/3,p,L)] 


The  second  is  the  expectation  of  ^t(*>  J/)  when  ji,...,jL  are  drawn  independently  and 

identically  from  fLt{i)  with  weights  Wt{l3,i,-);  the  first  is  the  same  expectation,  conditional  on 
the  event  that  ji, . . .  jjp  are  all  unique.  Let  Pt,/3y;L  and  Pt,/3y;z,  denote  the  two  probability  laws 
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for  and  let  and  ^t,i3,i;L  denote  expectations  with  respect  to  them,  so  that 

Et{j3,i\L)  =  and^t(/3,i;L)  ji)]. 

The  bound  on  V[log  PLt^  (/3)]— V[log  -PTt„  (/3)]  derives  from  a  bound  on  -E* (/?,  z;  L)—Et{P,  i;L). 
Write 

L  L 

Et{P,i;L)  -  Et{l3,i;L)  =  ^  Xt(z,  j/)  -  j/) 


1^1 


We  define  probability  law  ^  and  associated  random  variables  ji,  ■  ■  ■  ,jL  and  Ji, . . . ,  Jl,  such 
that  marginally  ji,...,jL  are  distributed  according  to  Pt,/3,i;L  and  are  distributed 

according  to  Pt,,g,i;L,  but  the  variables  are  coupled  to  have  nontrivial  chance  of  agreeing.  Then, 


Et{l3,i;L)  -  EtW,i;L)  =  a;t(z,  j;)  -  ^  Xt(z,  j/) 

1=1  1=1 

sup  \\xt{i,j)\\ 


<  2L 


The  coupling  is  as  follows: 

(a)  Draw  (ji, . . . ,  Jl)  according  to  Pt,/3,i;L- 

(b)  If  are  all  unique,  set  otherwise  draw 

independently  according  to  Pt,/3,i;L. 

With  K  =  supjgj^j(q  ||xt(z,  j)||.  Lemma  C.l  shows 


■  t,l3,i;L 


■  ,3l)  +  (ji,  ■  •  ■  Jl)  r  < 


exp{4iL||/3||} 

|Jz(z)| 


The  resulting  bound  on  || V[log PLt(/3)]  —  V[log  PTt(,5)]||  now  follows  by  expressing 
V[log:PLt(/3)]  -  V[logPLt(/3)]  =  ^  £;tJ/3,z„;|J™|)-.EtJ/3,z„;|J™|). 

t^<t 


Using  Et{P,i-L)-Et{fi,i-L)  <  K  (£  -  1)  ,  we  get 


V[logPLt(/3)]  -V[logPLt(/3)]||  <i^exp{4X||/?||}.  ^ 


tm^t 


We  get  the  final  bound  for  the  gradients  by  replacing  the  numerators  of  the  summands  with 

®aPm  I  I  ■ 

Using  the  same  methods.  Lemma  C.2  derives  the  bound  on  the  difference  in  Hessians. 


4.2.  Bias  correction  from  the  approximate  partial  likelihood 

When  we  use  ad-hoc  duplication,  we  are  performing  approximate  inference  under  the  multicast 
model  of  (9).  In  practice,  even  if  we  explicitly  want  to  use  the  multicast  model,  computing  the 
partial  likelihood  of  (10)  involves  an  intractable  combinatorial  sum,  so  we  may  resort  to  using 
the  approximation  instead.  Maximizing  log  PLt{-)  instead  of  logPT((-)  introduces  bias  in  the 
estimate  of  /3o-  Theorem  4.2  (proved  in  Appendix  C)  bounds  the  bias. 
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Theorem  4.2.  Under  the  setup  of  Theorem  4-1)  let  /3„  maximize  log  PLt^{-)  and  let  j3n 
maximize  log  PLt^{-)-  Suppose  for  all  n  that  the  Hessian  iV^[logFLt„(-)]  is  uniformly  locally 
Lipschitz  and  with  smallest  eigenvalue  bounded  away  from  zero  in  a  neighborhood  of  /3„.  If 

p 

Gntri  — >■  0,  then 

||/3„-/3„||  =Op(G„/n). 

It  remains  to  be  shown  that  /3„  is  a  consistent  estimator  of  /3o-  This  follows  directly  from 
the  theory  in  Section  3,  since  the  multicast  case  can  be  considered  as  a  special  case  of  the 
single  receiver  case:  Consider  the  product  X  x  N+  as  the  sender  set,  and  the  power  set  Vi  J) 
as  the  receiver  set.  For  sender  {i,L),  the  process  X{i;L)  is  then  the  baseline  send  intensity,  and 
{J  C  Jt{i)  :  |J|  =  L}  is  the  receiver  set;  for  sender-receiver  pair  (^{i,L),j),  vector 
is  the  covariate  vector.  Consistency  of  the  MPLE  now  follows  from  Theorem  3.2. 

Suppose  the  true  MPLE,  /3„,  is  a  -yn-consistent  estimate  of  /3o-  (Theorem  3.2  gives  sufficient 
conditions.)  Theorem  4.2  says  that  if  \Jt„^(im)\  grows  fast  enough  to  make  G„  smaller  than 
Gp{y/n),  then  the  approximate  MPLE,  f3n,  is  also  -^/n-consistent.  Moreover,  if  ^/n{j3n  —  Po)  is 
asymptotically  Gaussian,  then  \/n{l3n  —  Pn)  is  asymptotically  Gaussian  with  the  same  covariance 
matrix  but  possibly  a  different  mean.  Under  enough  regularity,  —  ^  [V^  log  {Pn)]  consistently 
estimates  the  limiting  covariance  of  \/ri(Pn  —  Po)-  To  get  the  mean,  we  use  a  parametric  bootstrap 
as  follows. 

Assume  that  the  conditions  of  Theorem  4.2  hold.  The  residual  /?„  —  Pq  depends  continu¬ 
ously  on  Pq  and  the  covariate  process  Xt{i,j)-  Since  /3„  is  a  consistent  estimator  of  /?o,  we  can 
estimate  the  bias  in  /3„  via  a  parametric  bootstrap.  We  generate  a  bootstrap  replicate  dataset 
{{tmUrm  Jm^)}  by  drawing  Jm\  a  random  subset  of  Jt^{im)  with  size  \Jm\  whose  elements  are 
drawn  proportional  to  Wt^iPnPrm  ■)■  We  then  get  a  bootstrap  approximate  MPLE,  Pn  \  by 

(t) 

maximizing  ,  where 


Note  that  is  determined  from  the  original  dataset,  not  the  bootstrap  dataset.  For  each 

bootstrap  replicate,  we  get  a  residual  PX  —  Pn-  With  R  bootstrap  replicates,  we  estimate  the 
bias  by 

bias=  -  Pn- 

r—1 

We  adjust  for  estimator  bias  by  replacing  /3„  with  /3„  —  bias. 

5.  Fitting  the  modei  to  a  corporate  e-maii  network 

Recall  from  Section  1  that,  given  a  set  of  interaction  data  triples  {t,  i,j),  a  primary  modeling  goal 
lies  in  determining  which  characteristics  and  behaviors  of  the  senders  and  receivers  are  predictive 
of  interaction.  The  modeling  and  inference  framework  introduced  above  enables  us  to  directly 
address  these  concerns,  as  we  now  demonstrate  through  the  analysis  corporate  e-mail  network 
consisting  of  a  large  subset  of  the  e-mail  messages  sent  within  the  Enron  corporation  between 
1998  and  2002.  These  e-mail  interaction  data  give  rise  to  the  following  questions: 

Homophily  To  what  extent  are  traits  shared  between  individuals  (gender,  department,  or  se¬ 
niority)  predictive  of  interaction  behaviors? 
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Fig.  1.  Message  recipient  counts.  Each  of  the  21,635  points  corresponds  to  a  message,  with  y-vaiue 
showing  the  base-2  iogarithm  of  its  number  of  recipients  (ranging  from  1-57). 


Network  Effects  To  what  extent  are  dyadic  or  even  triadic  network  effects,  as  characterized 
by  past  interaction  behaviors,  relevant  to  predicting  future  interaction  behaviors? 

We  undertake  our  analysis  using  the  multicast  proportional  intensity  modeling  framework 
developed  in  Sections  2  and  3  above,  employing  both  static  covariates  reflecting  actor  traits, 
as  well  as  dynamic  covariates  capturing  network  effects.  The  bootstrap  technique  introduced 
in  Section  4  for  multicast  interactions  is  then  used  to  reduce  bias  in  the  estimated  effects,  as 
well  as  to  demonstrate  that  our  asymptotic  approximations  are  reasonable  in  this  data  modeling 
regime.  We  conclude  this  section  with  a  discussion  of  the  goodness  of  fit  of  our  model  in  this 
setting,  before  turning  our  attention  in  Section  6  to  an  evaluation  of  the  strength  of  homophily 
and  network  effects  in  explaining  these  data. 

5.1.  Data  and  methods 

Our  example  analysis  uses  publicly  available  data  from  the  Enron  e-mail  corpus  (Cohen,  2009), 
a  large  subset  of  the  e-mail  messages  sent  within  the  Enron  corporation  between  1998  and  2002, 
and  made  public  as  the  result  of  a  subpoena  by  the  U.S.  Federal  Energy  Regulatory  Commission 
during  an  investigation  into  fraudulent  accounting  practices.  We  analyze  the  dataset  compiled  by 
Zhou  et  al.  (2007),  comprising  21,635  messages  sent  among  156  employees  between  November  13, 
1998  and  June  21,  2002,  along  with  the  genders,  seniorities,  and  departments  of  these  employees. 

Approximately  30%  of  these  messages  have  more  than  one  recipient  across  their  To,  CC,  and 
BCG  fields,  with  a  few  messages  having  more  than  fifty  recipients  (see  Fig.  1).  In  the  subsequent 
analysis,  we  exclude  messages  with  more  than  5  recipients — a  subjectively-chosen  cutoff  that 
avoids  e-mails  sent  en  masse  to  large  groups. 

We  model  these  data  using  the  multicast  proportional  intensity  model  of  Section  4,  with 
T  =  J  =  {1,  2, . . . ,  156}  and  Jt{i)  =  21  \  {ij,  and  with  static  and  dynamic  covariates  described 
in  tf^  next  section.  We  fit  the  model  by  first  maximizing  the  approximate  log  partial  likelihood 
log  PLt{5)  of  (11),  and  then  employing  a  parametric  bootstrap  to  estimate  and  correct  the 
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Variate 

Gharacteristic  of  actor  i 

Gount 

L{i) 

member  of  the  Legal  department 

25 

m 

member  of  the  Trading  department 

60 

j{i) 

seniority  is  Junior 

82 

F{i) 

gender  is  Female 

43 

LJ{i) 

Lit)  ■  J{i) 

12 

TJ(i) 

Til)  ■  J(^) 

24 

LF{{) 

m  •  Fit) 

13 

TF{{) 

Til)  •  Fii) 

10 

JF{t) 

.Jii)  •  Fit) 

27 

Fig.  2.  Actor-specific  traits,  with  counts  of  how  many  actors  share  each  trait 


resultant  bias  in  parameter  estimates.  We  calculate  standard  errors  using  the  corresponding 
asymptotic  theory.  In  the  setting  of  this  example,  the  interaction  count  is  high,  so  the  asymptotic 
framework  developed  in  Sections  3  and  4  is  natural.  The  main  violation  of  assumptions  A1-A4 
is  that  our  covariates  (described  in  Section  5.2)  may  in  principle  be  unbounded;  nevertheless, 
bootstrap  calculations  (described  in  Section  5.3)  show  that  the  asymptotic  approximations  we 
employ  remain  reasonable  in  this  regime. 

We  wrote  custom  software  in  the  C  programming  language  to  fit  the  model  using  Newton’s 
method.  Our  implementation  exploits  structure  in  the  covariates  to  make  the  computational 
complexity  of  the  fitting  procedure  roughly  linear  in  the  number  of  messages  and  the  number  of 
actors.  Appendix  A  describes  the  fitting  procedure  in  detail.  It  took  approximately  70  minutes 
to  fit  the  full  model  using  a  standard  desktop  computer  with  a  2.4  GHz  processor  and  4GB  of 
RAM.  Each  bootstrap  replicate  took  approximately  20  minutes  to  generate  and  ht,  using  the 
original  estimate  as  a  starting  point  for  the  fitting  algorithm.  Most  of  the  complexity  in  the 
fitting  procedure  is  due  to  the  inclusion  of  triadic  covariates  as  described  below;  including  only 
dyadic  covariates  reduces  the  fitting  time  to  less  than  10  seconds. 


5.2.  Covariates 

The  goal  of  our  investigation  is  to  assess  the  predictive  ability  of  actor  traits  and  network  effects. 
To  this  end,  we  choose  covariates  that  encode  these  traits  and  effects.  Each  covariate  is  encoded 
as  a  component  of  the  time- varying  dyad-dependent  vector  Xt{i,j),  which  is  linked  to  the  rate  of 
interaction  between  sender  i  and  receiver  j  via  the  multicast  proportional  intensity  model  of  (1). 


5.2.1.  Static  covariates  to  measure  homophily  and  group-level  effects 

Gonsider  first  those  actor  traits  that  do  not  vary  with  time:  the  actors’  genders,  departments, 
and  seniorities.  We  encode  the  traits  of  actor  i  and  their  second-order  interactions  using  9 
actor-dependent  binary  (O/I)  variables,  as  described  in  Fig.  2. 

We  encode  all  90  identifiable  second-order  interactions  between  the  traits  of  sender  i  and 
receiver  j  as  components  of  a;t(i,j).  We  do  this  by  using  variates  of  the  form  y(j)  and  X (i) -Y (j) , 
where  X  and  Y  are  chosen  from  the  list  of  9  actor-dependent  variates  (L,  T,  J,  F,  LJ,  etc.). 
We  cannot  identify  the  coefficients  for  covariates  of  the  form  X{i)-1-,  if  a  component  of  is 

the  same  for  all  values  of  j,  then  the  corresponding  component  of  j3  will  not  be  identifiable  since 
the  product  of  the  two  can  be  absorbed  into  Xt{i)  without  changing  the  likelihood.  Govariates 
of  the  form  I  •  Y{j)  pose  no  identifiability  problems. 
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send  i  - ►  j  i  has  sent  j  a  message  in  the  past 

receive  i  ■* -  j  i  has  received  a  message  from  j  in  the  past 

2-send  i  - ►  h  - ►  j  there  exists  an  actor  h  such  that  i  has  sent  h  a  mes¬ 

sage  and  h  has  sent  j  a  message  in  the  past 


2-receive  i  -< -  h  -  j  there  exists  an  actor  h  such  that  i  has  received  a 

message  from  h,  and  h  has  received  a  message  from  j 


sibling 


cosibling 


h 


i  j 


h 


i  j 


there  exists  an  actor  h  such  that  h  has  sent  i  and  j 
messages  in  the  past 


there  exists  an  actor  h  such  that  h  has  received  mes¬ 
sages  from  i  and  j 


Fig.  3.  Dynamic  covariates  to  measure  network  effects 


We  measure  homophily  by  way  of  the  estimated  coefficients  for  covariates  of  the  form  X(i)  ■ 
X{j).  For  example,  if  the  sum  of  the  coefficients  of  1  •  J(j)  and  J{i)  ■  J{j)  is  large  and  positive, 
this  tells  us  that  Junior  employees  exhibit  homophily  in  their  choice  of  message  recipients. 


5.2.2.  Dynamic  covariates  to  measure  network  effects 

Static  effects  are  useful  for  determining  which  traits  are  predictive  of  the  relative  rate  of  interac¬ 
tion  between  sender  i  and  receiver  j,  but  they  do  not  shed  light  on  network  effects.  Therefore, 
we  are  also  interested  in  the  predictive  relevance  of  the  dynamic  network  behaviors  described  in 
Fig.  3.  The  first  two  behaviors  (send  and  receive)  are  “dyadic,”  involving  exactly  two  actors, 
while  the  last  four  (2-send,  2-receive,  sibling,  and  cosibling)  are  “triadic,”  involving  exactly 
three  actors. 

To  measure  first-order  dependence  of  message  exchange  behavior  on  these  network  effects,  we 
introduce  binary  indicators  for  all  6  effects  as  components  of  These  indicators  depend 

on  the  sender  i,  the  receiver,  j,  and  the  history  of  the  process  at  the  current  time  t.  By  the 
shorthand  notation  Ijsend},  we  denote  the  indicator  variable  depending  on  sender  i,  receiver 
j,  and  the  current  time,  t,  which  indicates  if  i  has  sent  j  a  message  before  time  t,  with  the 
remaining  notations  (Ijreceive},  l{2-receive},  etc.)  defined  similarly. 

To  measure  higher-order  time  dependence,  we  introduce  additional  covariates  of  the  following 
form.  We  partition  the  interval  [— oo,t)  into  K  =  7  sub-intervals: 

[— oo,  t)  =  [t  —  Ak,  t  —  Ak-i)  U  [t  —  Ak-i,!  —  Ak-2)  U  •  •  •  U  [t  —  Ai,  t  —  Aq) 

where  oo  =  Ak  >  Ak-i  >  •  •  •  >  Ai  >  Aq  =  0  and  “t  —  oo”  is  defined  to  be  — oo.  Specifically, 
we  set  Afc  =  (7.5  minutes)  x  4^  for  k  =  1, . . . ,  K  —  1  so  that  for  k  in  this  range  Ak  takes  the 
values  30  minutes,  2  hours,  8  hours,  32  hours,  5.33  days,  and  21.33  days. 
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Define  the  half-open  interval  =  [t  —  Ak,t  —  Ak-i).  For  k  =  1, . . . ,  K  we  define  the 
dyadic  effects 

sendf^(i,  j)  =  #{i  ^  j  in  4'"^}, 

receive44*.j)  =  #{j  ^  *  in 

for  sender  i,  such  that  these  covariates  measure  the  number  of  messages  sent  to,  and  respectively 
received  by,  receiver  j  in  time  interval  • 

The  dyadic  effects  have  been  defined  in  the  manner  above  to  enable  easy  interpretation  of  the 
corresponding  coefficients.  To  illustrate  this,  for  k  =  1, K ,  suppose  that  /3fc  is  the  coefficient 
corresponding  to  send^^bj)-  If  observe  the  message  i  — >■  j  at  time  t,  then  for  future  time 
t'  in  the  interval  {t,t  +  Ai],  the  rate  Xt'ihj)  will  be  multiplied  be  the  factor  for  t'  in  the 
interval  {t  +  Ai,  t  +  A2],  the  rate  will  be  multiplied  by  ;  this  continues  similarly,  with  the  rate 
being  multiplied  by  6^*=  whenever  t'  €  (f-l- Afc_i,f-|- A^];  equivalently,  when  Afc_i  <  t'  —  t  <  Ak- 
Thus,  the  coefficients  (3i, ,  fix  measure  the  effect  of  a  “send  event”  and  how  this  effect  decays 
over  time.  We  expect  that  Pk  will  decrease  as  k  increases,  but  we  do  not  enforce  this  constraint 
on  the  estimation  procedure. 

The  triadic  effects  involve  pairs  of  messages.  For  k  =  1, . . .  ,K  and  I  =  1, . . . ,  K  we  define  the 
triadic  effects 

2-send4’'4b  j)  =  ^  #{f  ^  h  in  ^  J  in 

h^i,j 

2-receive4’'4b  j)  =  X]  ^  ^  h  in  4*4. 

h^i,j 

sibling4’'4b  j)  =  X]  ^  ^  J 

h^i,j 

cosibling4’'4b  j)  =  #{i  ^  h  in  4*^4  '  #{i  ^  h  in  4*4- 

h^i,j 

For  sender  i  and  receiver  j,  the  covariate  2-send4’*4b  j)  counts  the  pairs  of  messages  such  that 

(k) 

for  some  h  distinct  from  i  and  j,  message  i  ^  h  occurred  in  interval  and  message  h  ^  j 
occurred  in  interval  4*^.  the  other  covariates  behave  similarly. 

As  with  the  dyadic  effects,  the  triadic  effects  are  designed  so  that  their  coefficients  have 
a  straightforward  interpretation.  However,  since  triadic  effects  involve  pairs  of  messages,  the 
interpretation  is  a  bit  more  involved.  We  illustrate  with  the  2-send4’*4b  j)  covariate  having 
coefficient  Pk,i  for  k  =  1, ...,  K  and  I  =  1, . . . ,  AT.  Take  i  and  j  to  be  two  actors.  Suppose  at 
time  t  we  observe  the  message  h  ^  j.  At  this  point,  we  look  through  the  history  of  the  process 
for  all  messages  of  the  form  i  —>■  h;  when  paired  with  the  original  h  —>■  j  message,  each  of  these 
defines  a  “2-send  event.”  The  other  2-send  events  are  defined  as  follows:  if  at  time  s  we  observe 
the  message  i  ^  then  we  enumerate  all  observed  messages  h  — >■  j  in  the  history  of  the  process; 
when  each  of  these  is  paired  with  the  original  i  — )■  /i  event  it  constitutes  a  2-send  event.  A  pair 
(s,  t)  can  be  associated  with  each  2-send  event,  where  s  is  the  time  of  the  i  ^  h  message  and  t  is 
the  time  of  the  h  —>■  j  message.  At  time  t'  after  s  and  t,  the  existence  of  the  2-send  event  causes 
the  sending  rate  Xt'{i,j)  to  be  multiplied  by  the  factor  where  t'  G  {s  +  Ak-i,s  -b  A^]  and 
t'  G  {t  +  Ai-i,t  -b  A/].  We  expect  Pk,i  to  decrease  as  k  and  I  increase,  though  again  we  do  not 
enforce  this  constraint  in  the  htting  procedure. 

As  previously  noted.  Butts  (2008)  used  a  variant  of  the  proportional  intensity  model  to 
capture  interaction  behavior  in  social  settings.  As  such,  a  correspondence  can  be  drawn  between 
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certain  of  the  covariates  in  Butts  (2008)  and  those  outlined  above.  If  we  set  ilT  =  1,  then  the  sendj 
covariate  is  equivalent  to  an  unnormalized  version  of  Butts’  persistence  covariate,  and  the  sum 
(sendt +receivei)  becomes  an  unnormalized  version  of  Butts’  preferential  attachment  covariate. 
For  the  triadic  effects,  Butts’  OTP,  ITP,  ISP,  and  OSP  covariates  are  analogous  to  the  2-send, 
2-receive,  sibling,  and  cosibling  covariates,  although  the  exact  definitions  differ  slightly.  (For 
example,  OTPt(i,  j)  is  defined  as  h  in  (— oo,t)},  #{/i  — )■  j  in  (— oo,t)}].)  The 

versions  of  these  covariates  that  we  have  introduced  above,  however,  are  designed  to  enable  a  more 
precise  quantihcation  of  the  time-dependence  of  network  effects,  as  well  as  a  more  straightforward 
interpretation  of  the  corresponding  coefficients. 


5.3.  Bootstrap  bias  correction 

Given  the  model  specification,  data,  and  covariates  outlined  above,  we  can  estimate  the  parameter 
vector  /3q  under  the  approximate  log  partial  likelihood  of  (II).  Recall  that  the  results  of  Section  4 
bound  the  bias  resulting  from  this  approximate  MPLE  procedure  as  a  function  of  the  growth  rate 
of  the  recipient  set  over  time.  Here,  treating  the  set  of  156  Enron  employees  as  constant, 
the  resultant  bias  is  of  order  l/\J\ — and,  since  \J\  =  156  is  on  the  order  of  the  square  root 
of  the  number  21,365  of  messages  in  the  dataset,  we  can  correct  this  bias  using  the  parametric 
bootstrap  outlined  at  the  end  of  Section  4. 

Fig.  4  summarizes  the  corresponding  bootstrap  residuals  (from  300  replicates)  for  each  com¬ 
ponent  of  the  estimated  parameter  vector  /?□;  we  can  see  from  this  figure  that  treating  messages 
with  multiple  recipients  as  multiple  single-recipient  messages  introduces  bias  on  the  order  of  the 
standard  error  for  most  of  the  coefficients.  There  is  a  pronounced  negative  bias  in  coefficient 
estimates  for  the  dyadic  effects,  which  is  representative  of  a  more  general  phenomenon.  Sparsity 
in  the  components  of  Xt{i,j)  (when  considered  as  a  function  of  j),  when  combined  with  high 
values  of  the  corresponding  entries  /?,  leads  to  negative  bias  in  the  coefficient  estimates  when 
there  are  messages  with  multiple  recipients.  The  approximation  in  (10)  is  worst  when  for  some 
j*,  weight  far  exceeds  all  other  values  of  j),  so  that  j*)  ~ 

when  \Jm\  is  large,  the  maximum  of  PL  will  avoid  this  situation  by  shrinking  j3  where  Xt^(irmj) 
is  sparse.  The  dyadic  covariates  are  particularly  sparse,  so  the  estimates  for  their  coefficients  are 
particularly  vulnerable  to  this  bias. 

Besides  correcting  for  bias,  the  bootstrap  simulations  give  us  confidence  that  the  asymptotic 
approximations  are  reasonable.  The  simulated  standard  errors  are  very  close  to  those  predicted  by 
the  theory,  despite  the  norm  ||a;i(I,  j)||2  being  potentially  unbounded,  contrary  to  the  assumptions 
of  Theorem  3.1. 


5.4.  Goodness  of  fit 

Figure  5  details  an  ad-hoc  analysis  of  deviance  for  the  fitted  model,  showing  how  the  approx¬ 
imate  deviance  (twice  the  approximate  log  partial  likelihood)  behaves  as  we  add  consecutive 
terms  to  the  model.  Group-level  (static)  effects  account  for  18%  of  the  residual  deviance  and 
network  effects  account  for  34%.  The  most  dramatic  decrease  in  the  residual  deviance  comes 
from  introducing  the  “Send”  terms  into  the  model;  with  only  8  degrees  of  freedom,  they  are  able 
to  account  for  31%  of  the  null  deviance.  The  full  model  accounts  for  54%  of  the  null  deviance. 

The  residual  deviance  for  the  full  model  is  approximately  4.8  times  the  residual  degrees 
of  freedom,  and  so  an  ad-hoc  adjustment  for  this  over-dispersion  is  to  multiply  the  calculated 
standard  errors  by  -v/TS  «  2.2. 

Note,  however,  that  the  residual  deviance  by  itself  is  not  adequate  as  a  goodness-of-Ht  mea¬ 
sure,  as  it  depends  only  on  the  estimated  coefficients  (see  Section  4.4.5  of  McGullagh  and  Nelder 
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Term 


Fig.  4.  Enron  bootstrap  residuals.  Summary  of  bootstrap  residuals  for  estimated  coefficients  using  the 
Enron  dataset,  normalized  by  estimated  standard  errors.  The  points  (orange)  show  the  means,  and  the 
error  bars  (purple)  show  one  standard  deviation.  Coefficients  are  grouped  by  model  term. 


Term 

Df 

Deviance 

Resid.  Df 

Resid.  Dev 

Null 

32261 

325412 

Static 

90 

59920 

32171 

265492 

Send 

8 

100758 

32163 

164734 

Receive 

8 

5396 

32155 

159338 

Sibling 

50 

3532 

32105 

155806 

2-Send 

50 

836 

32055 

154970 

Cosibling 

50 

1387 

32005 

153583 

2-Receive 

50 

299 

31955 

153284 

Fig.  5.  Ad-hoc  analysis  of  deviance  for  the  Enron  model.  Residual  deviance  is  defined  as  twice  the 
approximate  negative  log  partial  likelihood  from  (11).  The  “Static”  term  contains  the  group  level  effects, 
and  the  “Dynamic”  term  contains  the  reciprocation  effects. 
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(1989)  for  discussion  of  a  related  problem  for  logistic  regression  with  sparse  data).  To  shed  more 
light  on  how  well  the  model  fits  these  data,  we  use  a  normalized  version  of  the  martingale  residual 
from  Therneau  et  al.  (1990),  which  we  call  a  Pearson  residual.  Specifically,  given  jd,  we  define 


=  E 

trrr<t 


r. 

- ,  -  ,  l{»m 


0: 


which  is  the  expected  number  of  i  — )■  j  events  given  the  estimated  model,  with  J  Xt{i)  dt 
estimated  by  the  Breslow  (1974)  estimate  J  Wt{j3,i)~^^jdNij{t).  The  martingale  residual 
analogous  to  that  of  Therneau  et  al.  (1990)  is  then  defined  as  Nt{i,j)  —  we  nor¬ 

malize  this  quantity  by  an  estimate  of  its  standard  deviation  to  get  a  “Pearson”  residual: 

Fig.  6a  shows  a  plot  of  Noo{i,j)  versus  Noo{i,j)  for  two  different  models.  In  the  “static” 
model,  we  only  include  the  static  covariates,  while  in  the  full  (“static  and  dynamic”)  model,  we 
also  include  all  six  types  of  network  covariates.  The  fit  for  the  static  model  is  poor.  For  instance, 
it  repeatedly  predicts  up  to  200  i  — >■  j  events  where  we  only  observed  1  or  2;  likewise,  the  model 
predicts  1  or  fewer  events  where  we  observed  up  to  20.  For  the  full  model,  which  includes  the 
dynamic  covariates  to  account  for  network  effects,  the  fit  is  much  better,  with  the  relationship 
between  observed  and  expected  interaction  counts  being  roughly  linear. 

Fig.  6b  shows  the  Pearson  residuals.  For  the  full  model,  more  than  95%  are  less  than  1.15 
in  absolute  value,  and  the  maximum  absolute  residual  is  21.8.  In  contrast,  the  95%  quantile  for 
the  absolute  residuals  in  the  static  model  is  at  3.7,  and  the  maximum  absolute  residual  is  125.0. 
The  sum  of  squares  if  the  residuals  {X"^)  is  16498  in  the  full  model,  over  29  times  lower  than  that 
for  the  static  model  (480661).  We  don’t  know  what  a  “reasonable”  value  for  X'^  is;  an  ad-hoc 
degrees  of  freedom  calculation  suggests  that  for  the  full  number  this  should  be  roughly  equal  to 
23974  =  156  •  155  —  (90  -I-  2  •  8  -I-  2  •  50),  which  suggests  that  the  full  model  is  too  aggressive.  The 
bootstrap  simulations  confirm  this,  with  23974  being  5.2  standard  deviances  below  the  mean 
value  X"^  for  the  bootstrap  replicates. 

For  a  more  parsimonious  model,  we  might  drop  most  of  the  triadic  effects.  Indeed,  the 
model  which  only  uses  dyadic  effects  has  a  X'^  value  of  23785,  which  is  very  close  to  24074  = 
156  •  155  —  (90  -I-  2  •  8),  the  approximate  X^  degrees  of  freedom  for  that  model.  However,  at  this 
stage  we  desire  a  model  with  the  lowest  possible  bias,  and  also  wish  to  acquire  estimates  for  all 
of  the  network  effects. 


6.  Evaluating  the  strength  of  homophily  and  network  effects 

Given  the  model  fitting  procedure  and  results  described  above,  we  may  now  evaluate  the  strength 
of  homophily  and  network  effects  in  predicting  the  interaction  behavior  observed  in  our  data. 
Along  the  way,  we  compare  the  results  to  those  obtained  from  alternative  data  analysis  methods. 


6. 1.  Assessing  evidence  for  homophily  in  the  Enron  data 

The  analyses  of  Section  5  above  have  established  that  our  multicast  proportional  intensity  model 
with  chosen  covariates  is  reasonably  accurate  in  describing  message  recipient  selection,  condi¬ 
tional  on  the  sender  and  the  history  of  the  process.  Thus,  we  are  justified  in  using  the  estimated 
coefficients  from  the  model  to  assess  the  predictive  ability  of  the  corresponding  covariates. 

Our  first  task  is  to  gauge  the  predictive  strength  of  homophily.  To  this  end.  Fig.  7  shows  the 
estimated  group-level  coefficients  for  our  model.  Notably,  homophily  appears  present  for  almost 
all  main  effects  (Department,  Seniority,  and  Gender):  the  estimated  coefficients  of  L{j),  T{j), 
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Fig.  7.  Estimated  coefficients  and  standard  errors  for  group-ievei  covariates  of  the  form  X{i)-Y  (j),  where 
i  is  the  sender,  j  is  the  receiver,  and  X{i)  and  Y{j)  are  given  in  the  row  and  coiumn  headings;  dark 
coefficients  are  significant  (via  Waid  test)  at  ievei  10“^. 


J(j),  and  F{j)  are  all  negative,  while  the  sums  of  the  estimated  coefficients  of  the  following  pairs 
are  all  positive:  T{j)  and  T{i)  ■  T{j);  J{j)  and  J{i)  ■  J{j);  F{j)  and  F{i)  ■  F{j).  Taking  Gender 
as  an  example,  the  way  this  homophily  effect  manifests  is  as  follows:  if  i  is  sending  a  message 
at  time  t,  and  person  j  is  identical  to  person  j'  except  for  Gender,  then  i  is  more  likely  to  send 
to  the  similarly-gendered  individual.  The  relative  rate  is  exp(— 0.08  -I-  0.41)  «  1.4  if  i  is  Female, 
and  exp(0.08)  «  1.1  if  i  is  Male.  The  characterization  for  other  types  of  homophily  is  similar. 

Conspicuously,  the  only  trait  for  which  first-order  homophily  was  absent  was  for  the  Le¬ 
gal  group:  the  sum  of  the  estimated  coefficients  of  L{j)  and  L{i)  ■  L{j)  was  negative  (—0.30), 
indicating  that  employees  in  the  Legal  group  exhibited  anti-homophilic  behavior. 


6.2.  A  comparative  analysis  based  on  contingency  tables 

Were  we  interested  only  in  homophily,  we  might  be  tempted  to  forgo  the  proportional  intensity 
model  of  (1),  and  instead  perform  a  contingency  table  analysis.  However,  as  we  now  describe, 
such  an  analysis  leads  to  very  different  conclusions  about  the  predictive  strength  of  homophily 
in  our  data. 

For  example,  suppose  that  we  are  interested  in  testing  for  seniority-based  homophily.  Ignoring 
network  effects  and  other  dependency,  we  might  model  P{i  — )■  j  |  z},  the  probability  of  employee 
j  being  the  recipient  of  a  message  given  that  employee  i  is  the  sender,  by  way  of  a  multinomial 
logit  model: 

P{z  ^  j  I  z}  cx  exp{/3jJ(j)  +/3jjJ(z)J(j)}. 

In  this  setting,  Junior-Junior  homophily  would  manifest  in  a  positive  value  of  j3j  +  j3jj  and 
Senior-Senior  homophily  would  manifest  in  a  negative  value  of  f3j. 

Since  there  are  zzj  =  82  Junior  executives  and  ns  =  74  Senior  executives,  and  since  the 
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Fig.  8.  Estimated  coefficients  and  standard  errors  for  the  contingency  tabie-based  anaiysis  of  Section  6.2 


sender  and  receiver  of  a  message  are  distinct,  we  have  that 


j  I  =  1} 

P{t  ^  j  I  =0}, 


q{0j+0jj)JU) 

{nj  —  +  ns 

gPjJU) 

nje^-’  +  {ns  —  1) 


In  turn,  we  compute  the  corresponding  maximum  likelihood  coefficient  estimates  using  the  entries 
of  a  2  X  2  table  that  counts  the  number  of  messages  exchanged  between  each  group: 


Receiver 

Sender 

Junior 

Senior 

Junior 

7972 

5833 

Senior 

3977 

14479 

The  resultant  estimates  are  j3j  =  —1.4  and  $jj  =  1.6,  with  (Wald)  standard  errors  of  about  0.02 
for  each.  Indeed,  these  are  exactly  the  estimated  coefficients  we  would  obtain  if  we  were  to  use 
the  proportional  intensity  model  Xt{i,j)  =  At(i)  exp{/3j  J(j)  +  P j j  J {i)  J {j)} ,  a  result  that  holds 
more  generally  for  non-time-varying  covariates. 

One  problem  with  this  analysis  is  that  we  have  marginalized  over  the  other  covariates  (Gender 
and  Department),  potentially  introducing  a  Simpson’s  paradox.  This  issue  is  easily  rectified, 
however,  by  introducing  covariates  for  senders  and  receivers,  along  with  the  corresponding  second- 
order  covariate  interactions;  Fig.  8  shows  the  resulting  coefficient  estimates. 

The  far  more  important  problem  is  that  this  analysis  implicitly  assumes  each  message  to 
be  independent  and  identically  distributed,  conditional  on  the  sender  of  the  message.  This 
assumption  is  blatantly  false:  common  sense  tells  us  that  if  Junior  A  sends  a  message  to  Junior 
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Fig.  9.  Estimated  coefficients  for  network  indicator  effects 


B,  then  the  next  time  B  sends  a  message,  B  is  more  likely  to  choose  A  as  a  recipient.  Any 
homophily  effect  present  in  these  interaction  data  is  thus  likely  to  be  exaggerated  by  reciprocation 
and  other  network  effects.  Indeed,  comparing  the  contingency  table-based  estimates  in  Fig.  8 
with  the  estimates  from  the  proportional  intensity  model  with  network  effects  in  Fig.  7,  we  can 
see  that  the  coefficient  estimates  are  much  higher  when  we  don’t  adjust  for  network  effects. 
Thus  even  in  cases  where  network  effects  themselves  are  not  the  object  of  primary  interest,  it  is 
important  to  account  for  them  when  making  inferences  about  the  predictive  strength  of  other 
covariates. 


6.3.  Evaluating  the  importance  of  network  effects 

In  Section  6.1  we  established  that  homophily  was  predictive  of  sending  behavior,  even  after 
accounting  for  network  effects.  We  now  investigate  the  characteristics  of  these  network  effects 
and  establish  which  of  these  effects  are  of  greatest  importance. 

To  begin  our  analysis,  Fig.  9  shows  the  estimated  coefficients  for  the  network  indicator  effects, 
giving  a  crude  picture  of  the  predictive  importance  of  each  network  effect.  The  estimated  coef¬ 
ficients  are  all  positive,  indicating  that  network  effects  strengthen  the  ties  between  individuals. 
The  estimated  coefficient  for  I{send}  is  almost  four  times  larger  than  the  other  coefficients, 
agreeing  with  the  general  notion  that  one  is  most  likely  to  do  today  the  things  one  did  yesterday. 
The  next  tier  of  indicator  effects  comprises  Ijreceive},  Ijsibling},  and  l{2-send},  whose  es¬ 
timated  coefficients  are  each  around  0.80.  Two  triadic  effects,  l{2-receive}  and  Ijcosibling}, 
are  not  significantly  predictive  of  sending  behavior. 

The  estimated  coefficients  for  the  recency-dependent  covariates,  shown  in  Figs.  10  and  11, 
give  a  more  complete  picture  of  network  effects.  Firstly,  we  can  see  that  dyadic  effects  persist 
for  over  three  weeks  from  the  time  a  message  is  sent.  The  decay  of  the  estimated  coefficients 
is  roughly  exponential  in  the  time  elapsed,  corresponding  to  a  super-exponential  decay  in  the 
relative  sending  rate.  For  30  minutes  after  i  sends  a  message  to  j,  our  estimated  model  predicts 
that  the  rate  at  which  i  sends  to  j  will  be  multiplied  by  exp(l.lO)  «  3.01,  and  the  rate  at  which 
j  sends  to  i  will  be  multiplied  by  exp(1.73)  «  5.66;  then,  between  30  minutes  and  2  hours,  the 
rates  will  be  multiplied  by  exp(0.50)  «  1.65  and  exp(0.69)  «  1.99,  respectively;  this  proceeds 
similarly  until  after  21.3  days,  when  the  rates  will  be  multiplied  by  exp(0.003)  «  1.003  and 
exp(O.OOl)  «  1.001. 

Comparing  the  coefficients  for  send^^^  with  those  of  receive^^^  we  see  that  the  latter  are 
higher  for  k  <  2,  while  the  former  are  higher  for  k  >  2.  The  corresponding  intuition  is  that  if 
A  is  sending  a  message  up  to  two  hours  after  receiving  a  message  from  B,  then  A  is  likely  to 
respond  to  B,  but  after  that,  A  is  more  likely  to  send  to  an  individual  whom  A  e-mailed  at  the 
time  of  receiving  B’s  original  message  (provided  B  and  this  other  individual  are  identical  in  all 
other  respects) .  The  time  window  during  which  reciprocation  is  more  important  than  past  habit 
is  less  than  8  hours. 

From  Fig.  11,  we  can  see  that  the  triadic  effects  are  in  general  less  pronounced  and  are  much 
more  short-lived  than  the  dyadic  effects.  About  70%  of  the  estimated  coefficients  are  within 
3  standard  errors  of  0;  even  those  that  are  significantly  nonzero  mostly  lie  between  —0.05  and 
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Fig.  10.  Estimated  coefficients  for  dyadic  effects,  with  standard  errors 
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+0.05.  The  exceptions  are  the  coefficients  for  2-sendj^’^^  (0.27),  2-sendp’^^  (0.13),  2-sendp’^^ 
(—0.12),  2-receivep’^^  (0.06),  sibling^^’^^  (0.10).  and  sibling^’^^  (0.14).  We  may  interpret 
these  coefficients  as  follows: 

2-send  If  A  sent  B  a  message  in  the  last  2  hours  and  B  sent  C  a  message  in  the  last  30  minutes, 
then  A  will  send  messages  to  C  at  a  higher  rate;  however,  if  A  sent  B  a  message  between  2 
and  8  hours  ago  and  B  sent  C  a  message  in  the  last  30  minutes,  then  A  will  send  messages 
to  C  at  a  lower  rate. 

2-receive  If  C  sent  B  a  message  between  30  minutes  and  2  hours  ago,  and  if  B  sent  A  a  message 
between  1.3  days  and  5.3  days  ago,  then  A  will  send  messages  to  C  at  a  higher  rate. 

sibling  If  B  sent  A  and  C  messages  in  the  last  30  minutes,  then  A  and  C  are  more  likely  to 
send  messages  to  each  other;  if  B  sent  A  a  message  between  30  minutes  and  2  hours  ago, 
and  if  B  sent  C  a  message  between  2  hours  and  8  hours  ago,  then  A  is  more  likely  to  send 
C  a  message. 

Given  the  emphasis  on  transitivity  in  the  networks  literature,  it  may  at  first  seem  discon¬ 
certing  that  most  of  the  estimated  coefficients  for  the  time-dependent  triadic  effects  are  found 
to  be  insignificant  in  this  analysis.  However,  one  must  bear  in  mind  that,  except  for  messages 
sent  to  them  directly,  individuals  likely  have  no  knowledge  of  their  colleagues’  e-mail  activities, 
and  therefore  there  is  no  reason  why  this  activity  should  directly  affect  sending  behaviors.  Any 
predictive  power  the  triadic  effects  have,  then,  must  be  due  to  correlation  with  exogenous  factors. 
In  this  light,  it  is  not  surprising  that  the  triadic  effects  are  small  and  have  small  time  horizons. 


6.4.  Comparative  analyses  using  actor-oriented  and  exponential  random  graph  models 

The  results  of  Section  6.3  provide  a  detailed  view  of  the  ways  in  which  network  effects  can 
manifest  themselves  in  data.  To  further  bolster  our  conhdence  in  these  results,  we  compare  our 
corresponding  parameter  estimates  to  those  obtained  using  related  network  models. 

A  number  of  dynamic  network  models  exist  in  the  literature,  including  Hanneke,  Fu,  and 
Xing’s  (2010)  exponential  random  graph  model  with  time-varying  coefficients,  and  Kolar,  Song, 
Ahmed,  and  Xing’s  (2010)  time-varying  stochastic  block  model.  Alternative  approaches  explic¬ 
itly  based  on  point  processes,  but  excluding  network  effects  and  other  covariate  information, 
include  that  of  Malmgren  et  al.  (2009),  who  model  activity  at  the  level  of  the  individual  using 
a  hidden  Markov  model,  and  of  Heard  et  al.  (2010),  who  work  at  the  level  of  the  dyad,  assum¬ 
ing  a  piecewise-constant  interaction  rate.  The  closest  match  to  our  approach  is  given  by  the 
actor-oriented  model  of  Snijders  (2001,  2005),  which  we  now  detail  in  Section  6.4.1.  Then,  in 
Section  6.4.2,  we  provide  a  comparison  to  a  static  network  analysis  based  on  an  exponential 
random  graph  model. 


6.4.1.  Actor-oriented  model  analysis 

The  actor-oriented  model  is  designed  for  a  sequence  of  snapshots  Gi ,  G2 , . . . ,  of  network  activity, 
where  each  Gt  is  an  /  x  J  binary  matrix  representing  pairwise  connectivity  between  actors  at 
time  t.  The  model  is  best  suited  for  ties  that  persist  in  time,  not  instantaneous  events;  it  treats 
the  sequence  of  networks  as  a  first-order  Markov  chain,  with  the  distribution  of  Gt  determined  by 
Gt-i.  Actors  are  assumed  to  change  their  ties  between  times  t  —  1  and  t  to  maximize  a  stochastic 
utility  function  that  depends  on  characteristics  of  the  overall  network.  Essentially,  given  that 
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the  network  is  in  state  G,  and  that  actor  i  is  allowed  to  make  a  change,  he  will  change  his  link 
to  actor  j  according  to 


p{j\i,G)  oc  exp  {  ^  AT,(G(.  ^  ;))  +  Y,0'X{G  \  {.  ^  j})}. 

S=1  S  =  1 

where  Tg  and  Tg  are  network  statistics;  Pg  and  /3'  are  unknown  coefficients;  G{i  ^  j)  denotes  the 
network  obtained  either  by  adding  link  i  >  j  if  it  is  absent  or  removing  link  i  >  j  if  it  is  present; 
G  \  {z  — )■  j}  denotes  the  network  obtained  by  removing  z  — )■  j  if  it  is  present.  Additionally,  the 
rate  at  which  actor  z  changes  ties  between  observation  times  t—1  and  t  is  given  by  A(z),  specified 
via  another  parametric  model.  (See  Snijders  et  al.  (2010)  for  a  more  thorough  introduction.) 
The  change  probability  function  p(j|z,  G)  plays  a  similar  role  to  the  multiplier  exp{xt{i,  P} 
in  the  proportional  intensity  model,  and  the  change  rate  function  A(z)  plays  a  similar  role  to  the 
baseline  intensity  At(z). 

For  purposes  of  comparison,  we  specified  a  change  probability  model  p(j|z,  G)  with  network 
statistics  analogous  to  those  used  in  Section  5.2,  and  then  we  estimated  coefficient  sets  P  and  P' 
analogous  to  the  coefficients  in  the  proportional  intensity  model.  We  used  the  RSiena  package 
(Ripley  and  Snijders,  2011)  to  specify  and  fit  the  actor-oriented  model  after  binning  the  Enron 
e-mail  interaction  data  at  regular  intervals  to  obtain  network  snapshots  Gi,  G2,  and  G3.  Here, 
GtiPj)  =  1  if  message  i  ^  j  was  observed  in  period  t,  and  Gt(i,j)  =  0  otherwise;  periods 
1-3  correspond  to  consecutive  four-month  periods  in  the  year  2001.  The  subset  we  looked  at 
contains  60%  of  the  messages  in  the  Enron  corpus.  We  chose  this  particular  subset  and  temporal 
resolution  partially  for  computational  reasons,  but  also  to  make  the  network  change  statistics 
(Jaccard  coefficients)  within  the  range  recommended  by  RSiena  (near  0.3).  Approximately  2 
hours’  time  was  required  to  fit  the  model. 

We  included  the  following  terms,  chosen  to  mimic  the  covariates  detailed  in  Section  5.2: 

(a)  Outdegree/density  (out).  This  statistic  counts  the  number  of  outgoing  ties;  it  is  analogous 
to  our  At(z),  except  that  the  rate  is  the  same  for  each  sender. 

(b)  Group-level  edge  effects  (traits).  One  covariate  is  included  for  each  identifiable  first-order 
interaction  of  the  form  X{i)Y{j),  where  z  is  the  sender  and  j  is  the  receiver;  the  covariate 
counts  the  number  of  edges  i  ^  j  with  X{i)Y{j)  =  1.  These  effects  correspond  to  the 
group-level  effects  in  our  model.  We  would  have  liked  to  include  second-order  interactions 
as  well,  but  (for  computational  reasons)  were  unable  to  ht  the  model  with  higher-order 
effects. 

(c)  Outdegree/density  endowment  (outendow).  This  statistic  counts  the  number  of  deleted 
outgoing  ties;  it  corresponds  to  the  negative  of  the  send  term  in  our  model. 

(d)  Reciprocity  (recip).  This  statistic  counts  the  number  of  reciprocal  ties;  i.e.,  edge  sets  of 
the  form  {z  — >■  j,j  — )■  z}.  It  corresponds  to  the  receive  term  in  our  model. 

(e)  3-cycles  (Scycle).  This  statistic  counts  the  number  of  cyclic  triples;  i.e.,  edge  sets  of  the 
form  {/z  — )■  z,z  — )■  j,j  — )■  h}.  It  corresponds  to  the  2-receive  term  in  our  model. 

(f)  Transitive  triplets  (ttriple).  This  statistic  counts  the  number  of  transitive  triples;  i.e., 
edge  sets  of  the  form  {/z  — >  z,  z  — )■  /z,  h  — )•  j}.  It  corresponds  to  the  2-send,  sibling,  and 
cosibling  terms  in  our  model. 
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Fig.  12.  Estimated  effects  for  the  actor-oriented  modei  of  Section  6.4.1 


Note  that  after  binning  the  interaction  counts  to  form  network  snapshots  as  required  by  the 
actor-oriented  model,  it  is  impossible  to  separate  the  2-send,  sibling,  and  cosibling  effects. 
Further,  the  hrst-order  Markov  nature  of  the  model  restricts  our  ability  to  quantify  the  time 
decay  of  the  dynamic  network  effects.  Additionally,  as  we  were  unable  to  include  second-order 
interactions  between  the  group-level  effects  in  this  setting,  we  cannot  compare  the  estimated 
coefficients  for  the  hrst-order  interactions  with  those  obtained  from  the  proportional  intensity 
model.  However,  including  these  hrst-order  interactions  in  the  model  mediates  the  confounding 
role  of  homophily  in  estimating  the  network  effects,  our  primary  focus  for  this  comparison. 

Figure  12  shows  the  htted  coefficients  for  the  actor-oriented  model.  We  can  see  that  the 
estimated  network  effects  agree  qualitatively  with  those  in  Fig.  9,  as  outdegree/density  endow¬ 
ment  has  a  negative  coefficient  while  reciprocity  and  transitive  triplets  have  positive  coefficients. 
Further,  the  dyadic  coefficients  are  larger  than  the  triadic  coefficients.  A  discrepancy  between 
the  two  models  is  that  2-receive  had  a  negligible  effect  in  the  proportional  intensity  model, 
while  its  analogue  (Scycle)  had  a  small  negative  effect  in  the  actor-oriented  model.  One  possi¬ 
ble  explanation  for  this  discrepancy  is  that  treating  all  ties  as  binary  forces  a  negative  bias  in 
otherwise-unimportant  network  effects.  With  the  limitation  that  ties  are  binary,  when  actor  i 
tries  to  maximize  his  stochastic  utility,  he  is  forbidden  from  reinforcing  an  existing  i  ^  j  link;  he 
is  more  likely  to  link  to  an  actor  j'  for  which  link  i  — )■  j'  is  absent.  To  counteract  this  tendency, 
the  coefficient  of  Scycle  is  forced  to  be  negative. 


6.4.2.  Exponential  random  graph  model  analysis 

As  a  final  comparison,  we  fit  an  exponential  random  graph  model  to  our  data.  This  class  of 
models — one  of  the  more  popular  for  estimating  the  importance  of  network  effects — specihes  a 
probability  distribution  for  a  single  directed  graph  represented  by  a  binary  matrix  G.  It  supposes 
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that  PjG  =  g}  (X  Ts(g)},  where  Ts{g)  is  a  network  statistic,  for  example  the  number 

of  transitive  triples  in  the  graph.  (See  Anderson  et  al.  (1999)  for  a  detailed  survey.) 

To  apply  this  form  of  model,  we  employed  a  reduction  of  our  data  to  obtain  a  single  directed 
graph  G  as  follows.  Based  on  an  “elbow”  in  the  empirical  cumulative  distribution  function  of 
message  counts  Noo{i,j)  in  our  data,  we  chose  a  threshold  of  10  sent  messages  and  defined  G  by 

Gii,j)  =  l{Noo{i,j)  >  10}. 

Next,  as  in  our  comparison  to  the  actor-oriented  model,  we  chose  terms  in  the  model  to  mimic 
the  covariates  from  Section  5.2.  We  used  the  ergm  software  package  to  fit  the  model  (Handcock 
et  ah,  2011),  based  on  a  Markov  chain  Monte  Carlo  sample  size  of  10^  following  a  burn-in  period 
of  10®  iterates.  The  covariates  were  as  follows: 

(a)  Sender  effects  (sender).  One  covariate  is  included  for  each  sender,  measuring  the  outdegree 
of  the  sender.  The  corresponding  coefficient  plays  the  role  of  A*  (z)  in  our  model. 

(b)  Group-level  edge  effects  (edgecov).  One  covariate  is  included  for  each  identifiable  first- 
order  interaction  of  the  form  X{i)Y{j),  where  z  is  the  sender  and  j  is  the  receiver;  the 
covariate  counts  the  number  of  edges  i  —>■  j  with  X{i)Y{j)  =  1.  These  effects  correspond 
to  the  group-level  effects  in  our  model.  We  attempted  to  include  second-order  interactions 
as  well,  but  were  unable  (for  computational  reasons)  to  fit  the  model  with  these  terms. 

(c)  Mutuality  (mutual).  This  statistic  counts  the  number  of  mutual  ties;  i.e.,  edge  sets  of  the 
form  {i  — >  j,j  — >  z},  and  corresponds  to  the  receive  term  in  our  model. 

(d)  Cyclic  triples  (ctriple).  This  statistic  counts  the  number  of  cyclic  triples;  i.e.,  edge  sets 
of  the  form  {/z  — >  z,  z  — )■  j,j  — )■  h},  and  corresponds  to  the  2-receive  term  in  our  model. 

(e)  Transitive  triples  (ttriple).  This  statistic  counts  the  number  of  transitive  triples;  i.e., 
edge  sets  of  the  form  {h  — )•  z,  z  — )•  j,  /z  — )■  j}.  It  corresponds  to  the  2-send,  sibling,  and 
cosibling  terms  in  our  model. 

Note  that  reducing  the  interaction  data  to  a  single  directed  graph  has  important  modeling 
consequences.  As  with  the  case  of  the  snapshot-based  actor-oriented  model  detailed  above,  it  is 
impossible  to  separate  the  2-send,  sibling,  and  cosibling  effects,  and  the  inability  to  include 
second-order  interactions  between  group-level  effects  precludes  a  direct  comparison  with  the 
estimated  group-level  effects  from  the  proportional  intensity  model.  Further,  for  a  single  directed 
graph,  there  is  no  possibility  of  including  a  term  corresponding  to  send,  and  it  is  impossible  to 
quantify  the  time-dependence  of  the  network  effects. 

Figure  6.4.2  shows  the  fitted  coefficients  for  the  exponential  random  graph  model.  As  with 
the  case  of  the  actor-oriented  model  considered  in  Section  6.4.1  above,  the  estimated  network 
effects  agree  qualitatively  with  those  of  Fig.  9.  Specifically,  mutuality  and  transitive  triples  have 
positive  effects,  while  cyclic  triples  have  a  negligible  effect  (in  contrast  to  the  case  of  Fig.  12  from 
Section  6.4.1). 
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Fig.  13.  Estimated  coefficients  for  the  exponentiai  random  graph  modei  of  Section  6.4.2.  The  modei  aiso 
inciuded  a  term  for  each  sender  (not  shown);  furthermore,  standard  errors  returned  for  the  group-ievei 
edge  effects  were  nonsensicai,  and  so  are  not  reported. 

7.  Conclusion 

Our  analysis  of  the  Enron  corpus  in  Sections  5  and  6  above  has  demonstrated  the  ways  in 
which  static  and  dynamic  effects  manifest  themselves  in  e-mail  communication  networks,  and  we 
expect  similar  conclusions  to  hold  broadly  for  other  types  of  directed  interaction  data.  Relative  to 
alternatives  such  as  contingency  table  analyses,  actor-oriented  network  models,  and  exponential 
random  graph  models,  an  advantage  of  our  approach  lies  in  its  ability  to  model  the  given  data 
directly,  rather  than  in  an  aggregated  form.  We  are  able  to  adjust  for  network  effects  to  get 
more  reliable  estimates  of  homophily,  and  by  using  continuous-time  information  we  get  precise 
quantification  on  the  time-dependent  behavior  of  the  network  effects. 

The  foundation  of  our  work  is  Cox’s  (1972)  proportional  intensity  model  and  partial  likelihood 
theory,  tools  which  he  first  introduced  almost  forty  years  ago  and  which  have  been  significantly 
developed  since  then  (Cox,  1975;  Fleming  and  Harrington,  1991;  Andersen  et  ah,  1993;  Mar- 
tinussen  and  Scheike,  2006;  Cook  and  Lawless,  2007).  These  tools  are  used  extensively  in  the 
context  of  survival  analysis,  but  require  further  development  for  use  in  modeling  interaction  data. 
In  this  vein,  we  have  extended  the  associated  theory  in  two  directions:  first,  we  have  provided 
results  that  are  asymptotic  in  time  rather  than  in  the  size  of  the  population  under  study;  and 
second,  we  have  shown  that  treating  multicast  interactions  via  duplication  leads  to  bias  in  the 
parameter  estimates  (which  can  in  turn  be  corrected  in  certain  regimes). 

We  find  the  proportional  intensity  model  with  time- varying  covariates  to  be  particularly  useful 
for  modeling  repeated  directed  interactions.  The  model  is  simple,  flexible,  and  well  established, 
and  it  facilitates  investigation  into  which  traits  and  behaviors  are  predictive  of  interaction. 
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A.  Implementation 


To  compute  the  maximum  partial  likelihood  estimator,  we  use  Newton’s  method  as  described  in 
Boyd  and  Vandenberghe  (2004).  This  requires  an  efficient  algorithm  for  computing  the  gradient 
and  Hessian  of  the  log  partial  likelihood.  For  simplicity,  we  describe  the  case  of  strictly  pairwise 
interactions  with  no  ties  in  the  interaction  times.  We  use  the  notation  from  Section  2,  with  the 
model  from  (1)  and  the  partial  likelihood  from  (2).  Recall  that  Xt{i,j)  is  in  Rp.  Assume  that 
\T\  =  I  and  \  J\  =  J. 

Suppose  (ti,  A,  ji), . . . ,  (t„,  in,jn)  is  the  sequence  of  observed  interactions.  Set  n{i)  =  '■ 

im  =  i}-  The  partial  likelihood  factors  into  a  product  of  terms,  one  for  each  sender: 


PLt{p)=  n  PLt{p,i), 

iei 


PLt{(3,i) 


t. 


TT 


This  factorization  allows  us  to  compute  log  PL*  (/3)  and  its  derivatives  by  computing  the  sender- 
specific  terms  in  parallel  and  then  adding  them  together. 

The  gradient  and  Hessian  of  the  sender-specific  log  partial  likelihood  are  respectively 

V[logPLt(/3,i)]  =  ^  xtji.jm)  -  (15a) 

t^<t,  t^<t, 

im—i  'i-m—i 

-V2[logPLi(/3,i)]=  ^  Ht„(/?,^),  (15b) 

tm<t, 


where  Et{j3,i)  and  Vt{fi,i)  are  as  dehned  in  (5a)  and  (5b).  When  Xt{i,j)  is  constant  over 
time,  sufficient  statistics  for  /3  imply  that  these  formulae  simplify.  Otherwise,  computing  the 
first  two  derivatives  of  log  (/3)  necessitates  iterating  over  all  messages,  potentially  requiring 
time  0{n  J p^).  For  small-  to  medium-sized  datasets,  this  is  manageable,  but  for  large  network 
datasets  it  can  become  prohibitive.  In  the  seqnel  we  show  how  to  exploit  sparsity  to  drastically 
reduce  the  computation  time. 


A.1.  Initial  values 

We  will  need  to  compute  Wo(/3,i),  Eo{l3,i),  and  Vo(/3,i)  for  all  values  of  i  and  j. 

In  the  worst  case,  doing  so  will  take  0{I  J p^).  However,  often  the  senders  belong  to  a  small 
number,  /  <C  /  of  groups  such  that  if  i  and  i'  are  in  the  same  group,  then  the  corresponding 
values  of  Wq,  tto,  Eq,  and  Vq  are  the  same,  reducing  the  total  complexity  to  0{IJp^).  The 
remaining  complexity  estimates  assume  that  the  initial  values  have  all  been  pre-computed. 

A.2.  Exploiting  sparsity 

We  first  decompose  x  into  its  static  (non-time-varying)  and  dynamic  parts  as  follows: 

=  xo{i,j)  -k  Axt{i,j).  (16) 

Typically,  we  can  quickly  compute  the  dynamic  part  Axt{i,j)  at  each  observed  message  time  by 
incrementally  updating  it.  Further,  Axt{i,j),  is  zero  for  most  (z,  j)  pairs — often  Axt{i,j)  is  zero 
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unless  i  and  j  have  a  common  acquaintance  or  they  have  interacted  in  the  past.  For  convenience, 
set  Jo{i)  =  J .  Let 

=  {j  &  J  '■  j  €  Jt{i)  and  Axt{i,j)  ^  0  for  some  t  }  U  {j  €  J'  :  j  ^  Jt(i)  for  some  t  }. 


For  fixed  t  and  assume  that  computing  Axt{i,  j)  for  all  values  of  j  takes  amortized  time  0{dJ). 
Since  Jo(*)  =  J ■,  we  have  that 


wt{l3,i,j)  =wo{P,i,j)  ■  exp{l3'^ Axt{i,  j)}  ■  l{j  € 
=  wo{l3,i,j)  +  Awt{i,j); 

VFt(/3,z)  =  lFo(/3,i)  +  ^  AwtiiJ); 

j&Jii) 


where 


Awt{i,j)  =  wo{l5,i,j)[exy,{P^ Axt{i,j)}l{j  G  -  1]; 
here  we  have  used  that  Awt{i,j)  is  zero  unless  j  G  J{i)-  Write 

wtil3,i,j) 


Wt{P,i) 


then,  defining 


lt{i)  = 


Wo{P,i) 


we  can  express  7rt(/3,z,j)  as  follows: 


A7rt(/3,i,j) 


Awt(/3,z,  j) 


=  lt{i)Tro{l3,i,j)  +  AntiPAJ)- 


Moreover,  given  the  initial  values  Wo(/3,z)  and  wo{l3,i,j),  we  can  efficiently  keep  track  of  7t(z) 
and  ATrt{l3,i,  j):  for  any  z  and  t,  it  takes  amortized  time  0{Jdp)  to  evaluate  7z(z)  and  all  values 
of  At: as  j  varies. 


A. 3.  Computing  the  gradient 

In  evaluating  the  gradient  of  the  log  partial  likelihood  as  given  by  (15a),  the  sum  jm) 

can  be  computed  in  time  0{np),  while  the  computationally  expensive  term  is  ^tm.  (i®)  *m)' 
the  sequel  we  show  how  to  exploit  sparsity  in  x  to  reduce  the  associated  computational  overhead. 

To  simplify  the  notation,  we  suppress  the  dependence  of  all  quantities  on  /?  and  z.  Consider 
TTt  and  Att;  to  be  vectors  of  length  J,  and  write 


fo  =  7t7’‘o  +  Att^. 


Also,  let  Xt  =  Xt{i)  and  AA*  =  AXt{i)  be  the  J  x  p  matrices  whose  jth  rows  are  Xt{i,j)  and 
Axt{i,j),  respectively,  so  that 

W  =  Ao  +  AAj. 

Using  these  expressions,  we  obtain 


Et  —  Xj  TTt  —  7t  Aq  +  Xj  Aiit  +  AAj’"  TTt, 
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and  thus, 

E  =  ( E  E  +  E 

mm  mm 

ijn—i  im—'i' 

Taking  advantage  of  the  sparsity  in  and  Att^,  computing  the  three  sums  on  the  right 

hand  side  takes  time  0{n{i)  J  dp).  Once  the  sums  are  known,  the  multiplication 

takes  time  0{p),  and  the  multiplication  Xq  ( ^  Att^^)  takes  time  0{J p).  Thus,  we  can  compute 
^ .  m  Et.^  in  time  0(n{i)  J  dp^.  Computing  these  terms  separately  for  each  i  and  then  summing 
over  all  i  to  get  the  total  gradient  requires  time  0{n  J  dp  +  I p). 

A. 4.  Computing  the  Hessian 

Computing  the  Hessian  according  to  (15b)  proceeds  similarly  to  the  case  of  the  gradient.  We 
need  to  efficiently  compute  the  sum  (/3)  *m);  while  a  naive  computation  requires  time 

0{nJp^),  this  can  be  signihcantly  improved  by  exploiting  sparsity  in  Xt{i,j). 

To  this  end,  dehne  ni(/3,  i)  to  be  the  J  x  J  diagonal  matrix  with  [nt(/?,  i)]jj  =  TTtiP,  i,j),  and 
set  AIlt{l3,i)  =  !!*(/?)*)  ~  no(/?)0-  Suppressing  the  dependence  on  /3  and  i,  we  have 

V)  =  Xj[n^  -  ^trrJ]Xt 

=  X^[IVt-ntnJ]X^  +  AXj[I].t-rTtTTj]Xo 

+  Xj[Ut  -  TTtTrJ]AXt  +  AA^pi  -  TTtTrJ]AXt. 

The  first  of  these  terms  reduces  to 

X^[I{t-TTt^J]Xo  =  itV^  +  7t(l-7t)i;oSj  -  E^i-itAiTt)'^ X^ 

-  Xo{itAnt)E'^  +  Ao^iAHt- A^iA^T]Ao, 

and  the  second  can  be  expressed  as 

AX]^[nt-TTtTTj]X^  =  {-ftAXtrTt)E^  +  AA^pt  +  Att^JAc- 

The  third  term  is  the  transpose  of  the  second;  the  fourth  does  not  simplify. 

To  compute  the  sum  only  accumulate  sums  of  terms  that  change  with  time: 

7t,  AiTt,  7t(l  -  7t),  7tA^t,  AirtAnJ,  jtAXtTTt,  AAj"'"[nt  +  TCtATr^'],  and  AA^’p*  -  jrtTrJjAXt. 
Doing  so  takes  time  0{J dp^)  for  each  time  increment.  As  with  the  gradient  computation,  we 
compute  the  sums  separately  for  each  i  and  then  sum  over  all  i,  so  that  the  total  computation 
time  is  0{nJ  dp^  +  I p^). 

A.5.  Total  computation  time 

To  perform  one  Newton  step  in  maximization  of  the  log  partial  likelihood  of  (2),  we  must  first 
compute  the  gradient  and  Hessian  of  the  log  partial  likelihood  at  the  current  value  of  /?,  and  then 
compute  the  inverse  of  the  Hessian  and  its  product  with  the  gradient.  Once  we  have  the  Hessian, 
computing  its  inverse  takes  time  0{p^).  Typically,  it  takes  0(1)  Newton  steps  to  compute  the 
maximum  of  a  convex  function  (the  constant  is  often  below  30).  The  key  factors  in  determining 
the  computation  time  using  the  factors  laid  out  above  are  I,  J,  and  d: 

•  The  value  of  I  depends  on  the  structure  of  xo{i,j).  Specihcally,  I  is  equal  to  the  number 
of  distinct  values  of  the  matrix  Ao(i)  as  i  varies.  For  the  Enron  data,  we  have  that  I  =  12: 
each  sender  belongs  to  one  of  12  groups  determined  by  group  (L/T/0),  seniority  (J/S), 
and  gender  (F/M),  and  so  the  matrix  Xo{i)  depends  only  on  the  group  of  i. 
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•  The  value  of  J  depends  on  the  sparsity  of  Xt{i,j).  If  Xt{i,j)  includes  only  dyadic  network 
effects,  then  J  will  typically  be  of  size  0(1)  or  0{J°‘)  for  a  fractional  value  a;  when  we 
add  triadic  effects,  this  size  will  typically  grow  to  at  most  0(J^“). 

•  The  value  of  d  depends  on  further  structure  in  xt{i,j).  In  our  implementation,  d  =  0(1) 
for  dyadic  effects  and  d  =  0(J)  for  triadic  effects. 


The  total  computational  cost  per  Newton  step  is  thus  0{I  J +  n  J  dp^  + 1 +P^),  with  the 
significance  of  this  expression  being  that  it  is  nearly  linear  in  I,  J ,  and  n.  Thus,  the  algorithm 
scales  naturally  to  large  datasets. 


B.  Results  from  Section  3 

B.  1.  Lemmas  from  the  proof  of  Theorem  3. 1 

Lemma  B.l.  Using  the  notation  of  Theorem  3.1,  under  assumption  Al  the  process 
from  (7)  is  a  square-integrable  martingale  adapted  to  . 

Proof.  The  conditional  expectation  property  holds  provided  E[C/t^(/3o)  |  =  Oi^_^(/3o). 

Define  K  =  sup^  j^-  ||a;t(i,  j)||.  Note  that  \\Ht{i,j)\\  <  2K.  Thus, 

||OtAt„(/3o)||  < 

so  that 


E 


sup||C/tAt„(/3o)f 


By  assumption  Al,  is  finite,  and  by  construction,  Nt^  is  bounded.  Since  NtAtr^  is  a  counting 
process,  EA?„  is  finite,  too  (this  follows  from  results  in  Section  2.3  of  Fleming  and  Harrington 
(1991)).  Thus,  C/tAt„(/3o)  is  uniformly  integrable.  The  Optional  Sampling  Theorem  now  ap¬ 
plies  to  give  the  conditional  expectation  property  of  f7^”i(/3o)-  For  square  integrability,  note 
supi<„<„E||[/t„||2  <  E  [supt  \\UtAt„WoW]  ■ 


Lemma  B.2.  Using  the  notation  of  Theorem  3.1,  under  assumption  Al,  the  Lindeberg  con¬ 
dition  for  Rebolledo’s  (1980)  Central  Limit  Theorem  is  satisfied:  for  any  positive  s. 


1 

n 


E 


\\Hsii,j)f  l{||i7s(Li)ll  >  Vns}dAsii,j)  4  0. 


Proof.  With  K  =  sup^  ||a;t(i,j)||  as  above,  the  integral  is  bounded  by  l{n  > 

e/2}  •  Since  EiL^  <  oo  by  assumption  Al,  the  first  term  converges  to  zero  in  probability. 
Since  EAt^  =  EA^j^  =  n,  the  product  of  the  two  also  converges  to  zero  in  probability.  Thus,  the 
Lindeberg  condition  is  satisfied. 


Lemma  B.3.  Using  the  notation  of  Theorem  3.1,  under  assumptions  Al  and  A3  we  have  that 


1 


E 


qc„j 


VAPou)  dM,{i) 


uniformly  in  a. 
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Proof.  Lenglart’s  (1977)  Inequality  and  assumption  A3  imply  that  for  any  positive  p  and 


<5, 


sup 


Vs{l3o,i)  dMsii) 


||y,(/3o,^)f  dA,(i) 


(see  Fleming  and  Harrington  (1991,  Cor.  3.4.1)  for  a  related  proof).  As  in  the  proof  of  Lemma  B.l, 
set  K  =  supj  ||a;t(z,  j)||.  The  sum  is  bounded  by  Since  0  by  assumption 

A1  and  EA^^  =  n,  the  right-hand  side  of  the  inequality  converges  to  Since  <5  is  arbitrary,  the 
right-hand  side  must  converge  to  zero. 


B.2.  Proof  of  Theorem  3.2 

We  follow  Haberman’s  (1977)  approach  to  proving  consistency,  which  relies  on  Kantorovich’s 
(1948)  analysis  of  Newton’s  method.  Tapia  (1971)  gives  an  elementary  proof  of  the  Kantorovich 
Theorem.  We  state  a  weak  form  of  the  result  as  a  lemma. 

Lemma  B.4  (Kantorovich  Theorem).  Let  P{x)  =  0  be  a  general  system  of  nonlinear 
equations,  where  P  is  a  map  between  two  Banach  spaces.  Let  P' (x)  denote  the  Jacobian  (Frechet 
differential)  of  P  at  x,  assumed  to  exist  in  Dq,  a  convex  open  neighborhood  ofxg.  Assume  the 
following: 

(a)  \\[P'ixo)]-^<B, 

(b)  \\[P'{xo)]-^Pixo)\\<p, 

(c)  \\P'{x)  -  P'{y)\\  <  K\\x  -  y\\,  for  all  x  and  y  in  Dq, 
with  h  =  BKr]  < 

Let  fl*  =  {x  :  ||a;  —  a;o||  <  2?7}.  If  C  Dq,  then  the  Newton  iterates,  Xk+i  =Xk  — 
[P'{xk)]~^P{xk),  are  well  defined,  remain  in  fl*,  and  converge  to  x*  in  O*  .such  that  P{x*)  =  0. 
In  addition, 

\\x* -Xk\\<l^^^,  fc  =  0,l,2,.... 

Proof  (Theorem  3.2) .  Set  Utf)  and  /((•)  to  be  the  gradient  and  negative  Hessian  of  the  log 
partial  likelihood,  as  defined  in  (5a-5b).  Since  /*(/?)  is  a  sum  of  rank-one  matrices  with  positive 
weights,  it  is  positive  semi-definite,  and  logPLt(-)  is  a  concave  function.  By  the  assumption 
that  the  smallest  eigenvalue  of  Si(-)  is  bounded  away  from  zero  in  a  neighborhood  of  /3o,  for  n 
sufficiently  large,  if  logPLt(-)  has  a  local  maximum  in  that  neighborhood  then  it  must  be  the 
unique  global  maximum. 

We  find  the  local  maximum  by  applying  Newton’s  method  to  the  gradient  of  ^  log  PLt^f), 
taking  fd^  as  the  initial  iterate.  Define 

The  first  Newton  iterate,  jdn.i,  is  equal  to  /3o  —  Zn.  Part  (b)  of  Theorem  3.1  and  the  assumptions 
of  the  theorem  imply  (/3o)]“^  exists  for  n  large  enough,  so  that  is  well-dehned.  Moreover, 

Part  (a)  of  Theorem  3.1  and  Slutsky’s  Theorem  imply  -4  0  and  yNiZn  A/’(0,  [Si(/lo)]“^). 

Apply  Kantorovich’s  Theorem  to  bound  ||/3n  —  /3o||  and  ||/3ra  —  /3„p||.  By  assumption,  there 
exists  a  neighborhood  of  /3o,  say  Dq,  and  finite  K  and  B,  such  that  ||^/t„(/3)  —  < 

K\\f3  -  j3'\\  and  ||^[/t„(/3o)]“^||  <  B  for  j3,fd'  €  Dq.  Define  =  \\Z„\\  and  hn  =  BKrin,  noting 
that  hn  and  ?7„  are  size  Thus,  for  n  large  enough. 
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(s)  WPn  —  /3o||  <  2?7„  — >  0, 

(b)  ^/n\\Pn-  iPo-  Zri)\\  <‘2y/nrinhri-^  0- 
Thus,  /3„  — )■  /3o,  and  •\/n(/?„  —  /3o)  and  y/nZn  converge  weakly  to  the  same  limit. 


C.  Results  from  Section  4 

Lemma  C.l.  Using  the  notation  and  assumptions  of  Theorem  4-1, 

exp{4iL  11/311} 


-  t,/3,i-L 


h,  ■  ■  •  Jl)  7^  (Ji,  •  •  ■  ,Jl)  }  < 


2J  |J*(z)|  ’ 

where  K  =  sup^  ||a:t(i, })|| . 

Proof. 

,  JZ,)  7^  (Ji,  ■  ■  •  Ji)}  =  ]Pt,/3.i;L{ji.  ■  •  •  Ji  has  a  duplicate} 

wt{/3,i,j)  12 


k<l 

L 


E 


Note  exp{— iL  ||/3||}  <  wt{P,i,j)  <  exp{iL|| /3||},  so  that 


E 

j&JtU) 


wt{l3,i,j)  ]2  ^  exp{4iL||/3||} 


Lemma  C.2.  Using  the  notation  and  assumptions  of  Theorem  4-1, 

V2[lograt(/3)]  -V2[logPLt(/3)]  <2iL2exp{4iL||/3||}.  J]  ^ 

i„<t 

Proof.  The  argument  is  similar  to  the  bound  on  the  difference  in  gradients  in  the  proof  of 
Theorem  4.1.  The  Hessians  of  the  weights  are 


y*(/3,i;L)  =  V2[logTTt(/3,i;L)] 


pp  E  Wt{l3,i,J)  Xt{i,J)- Et{P,i;L) 

\J\=L 


02 


Vt{l3,i;  L)  =  S/^  [log  Wt{l3,i;L)] 


=  L- 


T.j(^jpi)Wt{P,i,j)  xt{i,j)  -  ^Et{P,i]L) 


1  02 


The  first  is  the  covariance  matrix  of  under  Vt,i3,i-,L',  the  second  is  the  covariance 

matrix  of  the  same  quantity  under  The  result  follows  in  the  same  manner  as  in  the  proof 

of  Theorem  4.1.  The  relevant  intermediate  bound  is 


VtiP,i;L)-VtiP,i;L)  <2K^L^L-1) 


exp{4jL||/3||} 
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Proof  (Theorem  4.2).  We  know  that  Newton’s  method  applied  to  ^  log  PLt^{-)  converges 


to  Pn  after  sufficiently  many  iterations.  We  employ  /3„  as  the  initial  iterate  and  use  the  Kan¬ 
torovich  Theorem  (Lemma  B.4)  to  bound  ||/3„  —  ^„||. 

In  the  notation  of  the  lemma,  P{-)  is  the  gradient  of  ^\ogPLt^{-)  and  P' {■)  is  its  Hessian. 
The  conditions  of  Theorem  4.2  imply  assumptions  (a)  and  (c)  hold  uniformly  in  n  for  some  finite 
B  and  K.  Set 


lln 


V2[ilogPT*„(/3„); 


logPLt„(/3„) 


and  set  =  BKrjn.  Since  V  [  log  (,0„)]  =  0,  Theorem  4.1  and  the  boundedness  of  the 
inverse  Hessian  imply  ijn  =  Op^Gn/n).  Therefore,  for  n  large  enough, 


IIA 


< 


rjn 

h  20 


2?7„  =  e>p(G„/n). 
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